Introduction

This markdown document is designed to briefly show the revised results pertaining to the MitoImpute. We noticed that HaploGrep2 is able to capture more haplogroups than the method of haplogroup assignment we were previously using, HiMC. Therefore, we have generated results to display the old HiMC outputs, as well as the new HaploGrep outputs. Additionally, I have included string distances between the ‘truth’ haplogroupings, assigned from the multiple sequence alignment, and quality scores

Minor allele frequency experiments

This section will detail the minor allele frequency experiments.

## Rows: 387
## Columns: 71
## $ array                                           <fct> BDCHP-1X10-HUMANHAP24…
## $ mcmc                                            <chr> "MCMC1", "MCMC1", "MC…
## $ refpan_maf                                      <ord> MAF1%, MAF1%, MAF1%, …
## $ k_hap                                           <ord> kHAP500, kHAP500, kHA…
## $ imputed                                         <lgl> TRUE, FALSE, FALSE, F…
## $ info_cutoff                                     <dbl> 0.3, NA, NA, NA, NA, …
## $ n_snps_array                                    <dbl> 309, NA, NA, NA, NA, …
## $ n_snps_imputed                                  <dbl> 483, NA, NA, NA, NA, …
## $ n_snps_cutoff_imputed                           <dbl> 467, NA, NA, NA, NA, …
## $ n_type_0                                        <dbl> 181, NA, NA, NA, NA, …
## $ n_type_1                                        <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2                                        <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3                                        <dbl> 73, NA, NA, NA, NA, 4…
## $ n_type_0_cutoff                                 <dbl> 165, NA, NA, NA, NA, …
## $ n_type_1_cutoff                                 <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2_cutoff                                 <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3_cutoff                                 <dbl> 73, NA, NA, NA, NA, 4…
## $ mean_info                                       <dbl> 0.8791739, NA, NA, NA…
## $ mean_info_cutoff                                <dbl> 0.9037966, NA, NA, NA…
## $ mean_maf                                        <dbl> 0.06190269, NA, NA, N…
## $ mean_maf_cutoff                                 <dbl> 0.06381799, NA, NA, N…
## $ mean_mcc                                        <dbl> 0.8179815, NA, NA, NA…
## $ mean_mcc_cutoff                                 <dbl> 0.8727745, NA, NA, NA…
## $ mean_concordance                                <dbl> 0.9958531, NA, NA, NA…
## $ mean_concordance_cutoff                         <dbl> 0.9959055, NA, NA, NA…
## $ mean_certainty                                  <dbl> 0.9973703, NA, NA, NA…
## $ mean_certainty_cutoff                           <dbl> 0.9974721, NA, NA, NA…
## $ mean_himc_concordance_typed                     <dbl> 0.9806553, NA, NA, NA…
## $ mean_himc_concordance_typed_macro               <dbl> 0.9936834, NA, NA, NA…
## $ mean_himc_concordance_imputed                   <dbl> 0.9885511, NA, NA, NA…
## $ mean_himc_concordance_imputed_cutoff            <dbl> 0.9885511, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro             <dbl> 0.9984208, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro_cutoff      <dbl> 0.9984208, NA, NA, NA…
## $ mean_haplogrep_concordance_typed                <dbl> 0.3062352, NA, NA, NA…
## $ mean_haplogrep_concordance_typed_macro          <dbl> 0.9932912, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed              <dbl> 0.2841358, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_cutoff       <dbl> 0.2892660, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro        <dbl> 0.9940805, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro_cutoff <dbl> 0.9940805, NA, NA, NA…
## $ mean_haplogrep_quality_truth                    <dbl> 0.8560609, NA, NA, NA…
## $ mean_haplogrep_quality_typed                    <dbl> 0.9822484, NA, NA, NA…
## $ mean_haplogrep_quality_imputed                  <dbl> 0.9785349, NA, NA, NA…
## $ mean_haplogrep_quality_imputed_cutoff           <dbl> 0.9789348, NA, NA, NA…
## $ mean_haplogrep_distance_dl_typed                <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed              <dbl> 2.160616, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed_cutoff       <dbl> 2.123125, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_typed                <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed              <dbl> 2.160616, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed_cutoff       <dbl> 2.123125, NA, NA, NA,…
## $ mean_haplogrep_distance_jc_typed                <dbl> 0.2800019, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed              <dbl> 0.3019733, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed_cutoff       <dbl> 0.2982575, NA, NA, NA…
## $ himc_diff                                       <dbl> 0.007895776, NA, NA, …
## $ himc_cutoff_diff                                <dbl> 0.007895776, NA, NA, …
## $ himc_macro_diff                                 <dbl> 0.004737465, NA, NA, …
## $ himc_macro_cutoff_diff                          <dbl> 0.004737465, NA, NA, …
## $ haplogrep_diff                                  <dbl> -0.022099448, NA, NA,…
## $ haplogrep_cutoff_diff                           <dbl> -0.016969219, NA, NA,…
## $ haplogrep_macro_diff                            <dbl> 0.000789266, NA, NA, …
## $ haplogrep_macro_cutoff_diff                     <dbl> 0.000789266, NA, NA, …
## $ haplogrep_quality_diff                          <dbl> -0.003713536, NA, NA,…
## $ haplogrep_quality_cutoff_diff                   <dbl> -0.003313575, NA, NA,…
## $ haplogrep_quality_diff_truth_typed              <dbl> -0.1261875, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed            <dbl> -0.1224739, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed_cutoff     <dbl> -0.1228739, NA, NA, N…
## $ haplogrep_distance_dl_diff                      <dbl> 0.2951855, NA, NA, NA…
## $ haplogrep_distance_dl_cutoff_diff               <dbl> 0.2576953, NA, NA, NA…
## $ haplogrep_distance_lv_diff                      <dbl> 0.2951855, NA, NA, NA…
## $ haplogrep_distance_lv_cutoff_diff               <dbl> 0.2576953, NA, NA, NA…
## $ haplogrep_distance_jc_diff                      <dbl> 0.0219713276, NA, NA,…
## $ haplogrep_distance_jc_cutoff_diff               <dbl> 0.018255556, NA, NA, …

HiMC

HiMC Haplogrouping

We previously found that imputing missing variants increased the accuracy of haplogroup assignments when using HiMC to assign haplogroups.

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0029833 0.0014916 0.0441965 0.9567721
Residuals 304 10.2600244 0.0337501 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8600744 0.0182800 304 0.8241030 0.8960458
MAF0.5% 0.8551735 0.0181017 304 0.8195530 0.8907939
MAF0.1% 0.8525313 0.0181017 304 0.8169108 0.8881517
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0049010 0.0257261 304 0.1905065 0.9801924
MAF1% - MAF0.1% 0.0075432 0.0257261 304 0.2932113 0.9537218
MAF0.5% - MAF0.1% 0.0026422 0.0255996 304 0.1032120 0.9941443
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0054205 0.0027102 0.1366648 0.8723168
Residuals 300 5.9493818 0.0198313 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.3103129 0.0140125 300 0.2827377 0.3378881
MAF0.5% 0.3203330 0.0140125 300 0.2927579 0.3479082
MAF0.1% 0.3176033 0.0140125 300 0.2900282 0.3451785
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0100201 0.0198166 300 -0.5056419 0.8686475
MAF1% - MAF0.1% -0.0072904 0.0198166 300 -0.3678945 0.9281329
MAF0.5% - MAF0.1% 0.0027297 0.0198166 300 0.1377474 0.9895942

HiMC Macrohaplogrouping

The trend of which can be further seen when only macro-haplogroups are considered:
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0059374 0.0029687 0.0926436 0.911544
Residuals 304 9.7415101 0.0320444 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8938627 0.0178121 304 0.8588120 0.9289133
MAF0.5% 0.8883512 0.0176383 304 0.8536425 0.9230599
MAF0.1% 0.8830726 0.0176383 304 0.8483639 0.9177813
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0055115 0.0250676 304 0.2198638 0.9737057
MAF1% - MAF0.1% 0.0107901 0.0250676 304 0.4304400 0.9029618
MAF0.5% - MAF0.1% 0.0052786 0.0249444 304 0.2116161 0.9756173
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0043381 0.0021691 0.0892709 0.9146221
Residuals 300 7.2892487 0.0242975 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.2363714 0.0155103 300 0.2058486 0.2668942
MAF0.5% 0.2455937 0.0155103 300 0.2150710 0.2761165
MAF0.1% 0.2401832 0.0155103 300 0.2096605 0.2707060
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0092223 0.0219349 300 -0.4204415 0.9072014
MAF1% - MAF0.1% -0.0038118 0.0219349 300 -0.1737785 0.9834902
MAF0.5% - MAF0.1% 0.0054105 0.0219349 300 0.2466630 0.9670199

These results suggest that there is no statistically significant difference in accurate assignment of haplogroups or macrohaplogroups between different Reference Panel minor allele frequency filtering thresholds.

HaploGrep 2.0

HaploGrep Haplogrouping

We are investigating using HaploGrep 2.0 for assigning haplogroups, as HaploGrep has a greater ability to assign haplogroups that cover all sub-groupings.

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0976431 0.0488216 4.789485 0.0089547
Residuals 304 3.0988201 0.0101935 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.1672931 0.0100462 304 0.1475243 0.1870620
MAF0.5% 0.1872476 0.0099482 304 0.1676716 0.2068236
MAF0.1% 0.2109831 0.0099482 304 0.1914071 0.2305590
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0199545 0.0141383 304 -1.411377 0.3362773
MAF1% - MAF0.1% -0.0436899 0.0141383 304 -3.090183 0.0061722
MAF0.5% - MAF0.1% -0.0237355 0.0140688 304 -1.687096 0.2117763
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.1160262 0.0580131 163.6211 0
Residuals 304 0.1077856 0.0003546 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% -0.0395844 0.0018736 304 -0.0432713 -0.0358975
MAF0.5% -0.0156206 0.0018553 304 -0.0192715 -0.0119696
MAF0.1% 0.0081149 0.0018553 304 0.0044639 0.0117658
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0239639 0.0026368 304 -9.088190 0
MAF1% - MAF0.1% -0.0476993 0.0026368 304 -18.089758 0
MAF0.5% - MAF0.1% -0.0237355 0.0026239 304 -9.046021 0

HaploGrep Macrohaplogrouping

The trend of which can be further seen when only macro-haplogroups are considered:
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0012794 0.0006397 0.019601 0.9805911
Residuals 304 9.9213982 0.0326362 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8842553 0.0179758 304 0.8488825 0.9196281
MAF0.5% 0.8792615 0.0178005 304 0.8442338 0.9142892
MAF0.1% 0.8813994 0.0178005 304 0.8463717 0.9164271
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0049939 0.0252980 304 0.1974015 0.9787485
MAF1% - MAF0.1% 0.0028559 0.0252980 304 0.1128921 0.9929985
MAF0.5% - MAF0.1% -0.0021379 0.0251736 304 -0.0849267 0.9960315
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0085920 0.0042960 2.340219 0.0980394
Residuals 304 0.5580574 0.0018357 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.0021255 0.0042633 304 -0.0062637 0.0105148
MAF0.5% 0.0121608 0.0042217 304 0.0038534 0.0204682
MAF0.1% 0.0142987 0.0042217 304 0.0059914 0.0226061
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0100353 0.0059998 304 -1.6725957 0.2174178
MAF1% - MAF0.1% -0.0121732 0.0059998 304 -2.0289253 0.1070455
MAF0.5% - MAF0.1% -0.0021379 0.0059703 304 -0.3580893 0.9317786

HaploGrep Haplogrouping (with info > 0.3 cutoff)

It should be noted that, by convention, imputed variants with an IMPUTE2 info score of info <= 0.3 are excluded from the final datasets. As such, I have also displayed these results where I have excluded any imputed sites within an info score info <= 0.3.

Imputed haplogroup corcordance, :
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Difference in haplogroup concordance between genotyped and imputed datasets with (cutoff info <= 0.3):
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0254557 0.0127278 1.252717 0.2872478
Residuals 294 2.9870905 0.0101602 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.1822736 0.0100297 294 0.1625344 0.2020127
MAF0.5% 0.1972935 0.0099319 294 0.1777469 0.2168401
MAF0.1% 0.2046356 0.0104522 294 0.1840649 0.2252062
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0150200 0.0141152 294 -1.0640995 0.5371477
MAF1% - MAF0.1% -0.0223620 0.0144860 294 -1.5436953 0.2720877
MAF0.5% - MAF0.1% -0.0073421 0.0144184 294 -0.5092128 0.8669185
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0739095 0.0369548 129.5151 0
Residuals 294 0.0838876 0.0002853 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% -0.0246040 0.0016808 294 -0.0279119 -0.0212961
MAF0.5% -0.0055747 0.0016644 294 -0.0088503 -0.0022990
MAF0.1% 0.0144649 0.0017516 294 0.0110176 0.0179122
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0190293 0.0023654 294 -8.044750 0
MAF1% - MAF0.1% -0.0390689 0.0024276 294 -16.093751 0
MAF0.5% - MAF0.1% -0.0200396 0.0024163 294 -8.293641 0

HaploGrep Macrohaplogrouping (with info ≥ 0.3 cutoff)

The trend of which can be further seen when only macro-haplogroups are considered:

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0063379 0.0031690 0.1089308 0.8968286
Residuals 294 8.5529105 0.0290915 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8900459 0.0169716 294 0.8566447 0.9234471
MAF0.5% 0.8839626 0.0168060 294 0.8508872 0.9170379
MAF0.1% 0.8786272 0.0176865 294 0.8438190 0.9134354
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0060833 0.0238847 294 0.2546947 0.9648766
MAF1% - MAF0.1% 0.0114186 0.0245122 294 0.4658356 0.8873319
MAF0.5% - MAF0.1% 0.0053354 0.0243978 294 0.2186813 0.9739841
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.0098631 0.0049316 2.410029 0.0915852
Residuals 294 0.6016025 0.0020463 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.0079161 0.0045011 294 -0.0009424 0.0167746
MAF0.5% 0.0168619 0.0044572 294 0.0080899 0.0256340
MAF0.1% 0.0219469 0.0046907 294 0.0127152 0.0311785
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% -0.0089458 0.0063346 294 -1.4122253 0.3358850
MAF1% - MAF0.1% -0.0140308 0.0065010 294 -2.1582526 0.0802339
MAF0.5% - MAF0.1% -0.0050850 0.0064707 294 -0.7858469 0.7120262

These results suggest that there is a statistically significant difference in accurate assignment of haplogroups between different Reference Panel minor allele frequency filtering thresholds. However, this improvement is tiny; therefore, the biological and practical significance of the improvement seems small.

These results suggest that there is no statistically significant difference in accurate assignment of macrohaplogroups between different Reference Panel minor allele frequency filtering thresholds. However, it should be noted that both the genotyped and imputed datasets allow HaploGrep to accurately call macrohaplogroups, with average accuracy in the high 80%s.

There is a slight increase in ability to accuracy call haplogroups when a filter of info > 0.3 is applied, but the biological and practical significance of the improvement again seems small.

HaploGrep haplogroup quality comparisons

We also examined the difference in HaploGrep’s quality score between the truthset, genotyped set, and imputed set.

Here I show the difference between the truth set and the genotyped set:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep.

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep.

Here I show the difference between the truth set and the imputed set:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Here I show the difference between the truth set and the imputed set with the info score filter info > 0.3:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Here it appears that relative to the truth set, the quality is still decreased.

However, I have also investigated the difference between the genotyped and imputed datasets to see if there is any improvement. I have only investigated the imputed dataset filtered with info > 0.3.
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

On average, there is a decrease in HaploGrep quality score.

HaploGrep string distance (Damerau-Levenshtein)

We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3

This result shows the Damerau-Levenshtein distance:
Boxplot of mean Damerau-Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Damerau-Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Damerau-Levenshtein string distance between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 4.239264 2.1196323 15.10913 6e-07
Residuals 294 41.244733 0.1402882 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed significant difference in the Damerau-Levenshtein string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.3900615 0.0372692 294 0.3167133 0.4634097
MAF0.5% 0.1255738 0.0369056 294 0.0529412 0.1982063
MAF0.1% 0.1539571 0.0388391 294 0.0775192 0.2303950
Table showing the contrasts for the linear model testing for significant difference in the means of significant difference in the Damerau-Levenshtein string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.2644877 0.0524501 294 5.0426543 0.0000024
MAF1% - MAF0.1% 0.2361044 0.0538281 294 4.3862644 0.0000477
MAF0.5% - MAF0.1% -0.0283833 0.0535770 294 -0.5297671 0.8567930

HaploGrep string distance (Levenshtein)

We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3

This result shows the Levenshtein distance:
Boxplot of mean Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Levenshtein string distance between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 4.240671 2.1203355 15.11616 6e-07
Residuals 294 41.239223 0.1402695 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed significant difference in the Levenshtein string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.3899951 0.0372667 294 0.3166518 0.4633384
MAF0.5% 0.1254856 0.0369031 294 0.0528579 0.1981134
MAF0.1% 0.1538171 0.0388365 294 0.0773843 0.2302498
Table showing the contrasts for the linear model testing for significant difference in the means of significant difference in the Levenshtein string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.2645094 0.0524466 294 5.0434049 0.0000024
MAF1% - MAF0.1% 0.2361780 0.0538245 294 4.3879250 0.0000474
MAF0.5% - MAF0.1% -0.0283314 0.0535734 294 -0.5288336 0.8572589

HaploGrep string distance (Jaccard)

We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3

This result shows the Levenshtein distance:
Boxplot of mean Jaccard string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Jaccard string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Jaccard string distance between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 0.1391510 0.0695755 278.8615 0
Residuals 294 0.0733525 0.0002495 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed significant difference in the Jaccard string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.0261476 0.0015717 294 0.0230544 0.0292409
MAF0.5% -0.0076583 0.0015564 294 -0.0107214 -0.0045952
MAF0.1% -0.0265022 0.0016379 294 -0.0297257 -0.0232787
Table showing the contrasts for the linear model testing for significant difference in the means of significant difference in the Jaccard string distance between assigned haplogroups for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0338059 0.0022119 294 15.283508 0
MAF1% - MAF0.1% 0.0526498 0.0022700 294 23.193401 0
MAF0.5% - MAF0.1% 0.0188439 0.0022594 294 8.340065 0

Matthew’s Correlation Coefficient (MCC)

We also determined imputation accuracy using the Matthew’s correlation coefficient (MCC). The MCC is a more direct method of measuring the imputation accuracy of genotypes (as opposed to haplotypes).

Boxplot of mean Matthew's correlation coefficient between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Matthew’s correlation coefficient between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Matthew’s correlation coefficient between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 1.928274 0.9641368 123.0775 0
Residuals 304 2.381407 0.0078336 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of Matthew’s correlation coefficient for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8667414 0.0088068 304 0.8494113 0.8840714
MAF0.5% 0.7584052 0.0087209 304 0.7412442 0.7755662
MAF0.1% 0.6726554 0.0087209 304 0.6554944 0.6898164
Table showing the contrasts for the linear model testing for significant difference in the means of Matthew’s correlation coefficient for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.1083362 0.0123941 304 8.740931 0
MAF1% - MAF0.1% 0.1940860 0.0123941 304 15.659517 0
MAF0.5% - MAF0.1% 0.0857498 0.0123332 304 6.952752 0

IMPUTE2 INFO Score

We are also reporting IMPUTE2’s INFO score. Here I will plot INFO scores for both the raw imputed data, and the imputed data after info score filtering

Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the IMPUTE2 INFO Score between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 5.610955 2.8054777 132.0276 0
Residuals 304 6.459749 0.0212492 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means ofIMPUTE2 INFO Score for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.7392964 0.0145048 304 0.7107539 0.7678388
MAF0.5% 0.6412761 0.0143632 304 0.6130121 0.6695401
MAF0.1% 0.4162709 0.0143632 304 0.3880069 0.4445348
Table showing the contrasts for the linear model testing for significant difference in the means of IMPUTE2 INFO Score for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0980203 0.0204130 304 4.801855 7.4e-06
MAF1% - MAF0.1% 0.3230255 0.0204130 304 15.824500 0.0e+00
MAF0.5% - MAF0.1% 0.2250052 0.0203127 304 11.077078 0.0e+00
Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the IMPUTE2 INFO Score (following filtering to info > 0.3) between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
refpan_maf 2 1.097597 0.5487984 224.6923 0
Residuals 304 0.742503 0.0024424 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means ofIMPUTE2 INFO Score (following filtering to info > 0.3) for different Reference Panel minor allele frequency filtering thresholds
refpan_maf emmean SE df lower.CL upper.CL
MAF1% 0.8461870 0.0049176 304 0.8365101 0.8558638
MAF0.5% 0.7911693 0.0048696 304 0.7815869 0.8007517
MAF0.1% 0.7010144 0.0048696 304 0.6914320 0.7105968
Table showing the contrasts for the linear model testing for significant difference in the means of IMPUTE2 INFO Score (following filtering to info > 0.3) for different Reference Panel minor allele frequency filtering thresholds
contrast estimate SE df t.ratio p.value
MAF1% - MAF0.5% 0.0550177 0.0069207 304 7.94976 0
MAF1% - MAF0.1% 0.1451726 0.0069207 304 20.97667 0
MAF0.5% - MAF0.1% 0.0901549 0.0068867 304 13.09124 0

DERLETE LATER

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

DERLETE LATER 2

Number of included reference haplotypes (k_hap) experiments

This section will detail the Number of included reference haplotypes (k_hap) experiments.

## Rows: 1,161
## Columns: 71
## $ array                                           <fct> BDCHP-1X10-HUMANHAP24…
## $ mcmc                                            <chr> "kHAP100", "MCMC1", "…
## $ refpan_maf                                      <ord> MAF1%, MAF1%, MAF1%, …
## $ k_hap                                           <ord> kHAP100, kHAP100, kHA…
## $ imputed                                         <lgl> TRUE, FALSE, FALSE, F…
## $ info_cutoff                                     <dbl> 0.3, NA, NA, NA, NA, …
## $ n_snps_array                                    <dbl> 309, NA, NA, NA, NA, …
## $ n_snps_imputed                                  <dbl> 500, NA, NA, NA, NA, …
## $ n_snps_cutoff_imputed                           <dbl> 492, NA, NA, NA, NA, …
## $ n_type_0                                        <dbl> 198, NA, NA, NA, NA, …
## $ n_type_1                                        <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2                                        <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3                                        <dbl> 73, NA, NA, NA, NA, 4…
## $ n_type_0_cutoff                                 <dbl> 190, NA, NA, NA, NA, …
## $ n_type_1_cutoff                                 <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2_cutoff                                 <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3_cutoff                                 <dbl> 73, NA, NA, NA, NA, 4…
## $ mean_info                                       <dbl> 0.8949340, NA, NA, NA…
## $ mean_info_cutoff                                <dbl> 0.9074289, NA, NA, NA…
## $ mean_maf                                        <dbl> 0.06253400, NA, NA, N…
## $ mean_maf_cutoff                                 <dbl> 0.06353455, NA, NA, N…
## $ mean_mcc                                        <dbl> 0.8104934, NA, NA, NA…
## $ mean_mcc_cutoff                                 <dbl> 0.8462349, NA, NA, NA…
## $ mean_concordance                                <dbl> 0.9949293, NA, NA, NA…
## $ mean_concordance_cutoff                         <dbl> 0.9949061, NA, NA, NA…
## $ mean_certainty                                  <dbl> 0.9975062, NA, NA, NA…
## $ mean_certainty_cutoff                           <dbl> 0.9974895, NA, NA, NA…
## $ mean_himc_concordance_typed                     <dbl> 0.9806553, NA, NA, NA…
## $ mean_himc_concordance_typed_macro               <dbl> 0.9936834, NA, NA, NA…
## $ mean_himc_concordance_imputed                   <dbl> 0.9893407, NA, NA, NA…
## $ mean_himc_concordance_imputed_cutoff            <dbl> 0.9893407, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro             <dbl> 1.0000000, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro_cutoff      <dbl> 1.0000000, NA, NA, NA…
## $ mean_haplogrep_concordance_typed                <dbl> 0.3062352, NA, NA, NA…
## $ mean_haplogrep_concordance_typed_macro          <dbl> 0.9932912, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed              <dbl> 0.3026835, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_cutoff       <dbl> 0.3026835, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro        <dbl> 0.9952644, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro_cutoff <dbl> 0.9952644, NA, NA, NA…
## $ mean_haplogrep_quality_truth                    <dbl> 0.8560609, NA, NA, NA…
## $ mean_haplogrep_quality_typed                    <dbl> 0.9822484, NA, NA, NA…
## $ mean_haplogrep_quality_imputed                  <dbl> 0.9789768, NA, NA, NA…
## $ mean_haplogrep_quality_imputed_cutoff           <dbl> 0.9790359, NA, NA, NA…
## $ mean_haplogrep_distance_dl_typed                <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed              <dbl> 2.093528, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed_cutoff       <dbl> 2.094317, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_typed                <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed              <dbl> 2.093528, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed_cutoff       <dbl> 2.094317, NA, NA, NA,…
## $ mean_haplogrep_distance_jc_typed                <dbl> 0.2800019, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed              <dbl> 0.2901451, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed_cutoff       <dbl> 0.2901677, NA, NA, NA…
## $ himc_diff                                       <dbl> 0.008685353, NA, NA, …
## $ himc_cutoff_diff                                <dbl> 0.008685353, NA, NA, …
## $ himc_macro_diff                                 <dbl> 0.006316621, NA, NA, …
## $ himc_macro_cutoff_diff                          <dbl> 0.006316621, NA, NA, …
## $ haplogrep_diff                                  <dbl> -0.003551697, NA, NA,…
## $ haplogrep_cutoff_diff                           <dbl> -0.003551697, NA, NA,…
## $ haplogrep_macro_diff                            <dbl> 0.001973165, NA, NA, …
## $ haplogrep_macro_cutoff_diff                     <dbl> 0.001973165, NA, NA, …
## $ haplogrep_quality_diff                          <dbl> -0.003271665, NA, NA,…
## $ haplogrep_quality_cutoff_diff                   <dbl> -0.003212510, NA, NA,…
## $ haplogrep_quality_diff_truth_typed              <dbl> -0.1261875, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed            <dbl> -0.1229158, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed_cutoff     <dbl> -0.1229750, NA, NA, N…
## $ haplogrep_distance_dl_diff                      <dbl> 0.2280979, NA, NA, NA…
## $ haplogrep_distance_dl_cutoff_diff               <dbl> 0.2288871, NA, NA, NA…
## $ haplogrep_distance_lv_diff                      <dbl> 0.2280979, NA, NA, NA…
## $ haplogrep_distance_lv_cutoff_diff               <dbl> 0.2288871, NA, NA, NA…
## $ haplogrep_distance_jc_diff                      <dbl> 0.010143189, NA, NA, …
## $ haplogrep_distance_jc_cutoff_diff               <dbl> 0.0101657397, NA, NA,…

HiMC

HiMC Haplogrouping

We previously found that imputing missing variants increased the accuracy of haplogroup assignments when using HiMC to assign haplogroups.

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 20.34980 2.5437249 92.33707 0
Residuals 900 24.79343 0.0275483 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.8671472 0.0165153 900 0.8347342 0.8995601
kHAP250 0.8643340 0.0165153 900 0.8319211 0.8967470
kHAP500 0.8601259 0.0165153 900 0.8277130 0.8925389
kHAP1000 0.8486516 0.0165153 900 0.8162386 0.8810645
kHAP2500 0.8119014 0.0165153 900 0.7794885 0.8443144
kHAP5000 0.7220557 0.0165153 900 0.6896427 0.7544687
kHAP10000 0.5948746 0.0165153 900 0.5624617 0.6272876
kHAP20000 0.5225614 0.0165153 900 0.4901484 0.5549744
kHAP30000 0.4739555 0.0165153 900 0.4415425 0.5063685
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0028131 0.0233562 900 0.1204443 1.0000000
kHAP100 - kHAP500 0.0070212 0.0233562 900 0.3006153 0.9999981
kHAP100 - kHAP1000 0.0184956 0.0233562 900 0.7918939 0.9970862
kHAP100 - kHAP2500 0.0552457 0.0233562 900 2.3653606 0.3045080
kHAP100 - kHAP5000 0.1450915 0.0233562 900 6.2121295 0.0000000
kHAP100 - kHAP10000 0.2722725 0.0233562 900 11.6574208 0.0000000
kHAP100 - kHAP20000 0.3445857 0.0233562 900 14.7535307 0.0000000
kHAP100 - kHAP30000 0.3931917 0.0233562 900 16.8346063 0.0000000
kHAP250 - kHAP500 0.0042081 0.0233562 900 0.1801711 1.0000000
kHAP250 - kHAP1000 0.0156825 0.0233562 900 0.6714496 0.9991049
kHAP250 - kHAP2500 0.0524326 0.0233562 900 2.2449163 0.3774111
kHAP250 - kHAP5000 0.1422783 0.0233562 900 6.0916852 0.0000001
kHAP250 - kHAP10000 0.2694594 0.0233562 900 11.5369765 0.0000000
kHAP250 - kHAP20000 0.3417726 0.0233562 900 14.6330864 0.0000000
kHAP250 - kHAP30000 0.3903785 0.0233562 900 16.7141620 0.0000000
kHAP500 - kHAP1000 0.0114744 0.0233562 900 0.4912786 0.9999130
kHAP500 - kHAP2500 0.0482245 0.0233562 900 2.0647453 0.4983509
kHAP500 - kHAP5000 0.1380702 0.0233562 900 5.9115142 0.0000002
kHAP500 - kHAP10000 0.2652513 0.0233562 900 11.3568055 0.0000000
kHAP500 - kHAP20000 0.3375645 0.0233562 900 14.4529153 0.0000000
kHAP500 - kHAP30000 0.3861704 0.0233562 900 16.5339909 0.0000000
kHAP1000 - kHAP2500 0.0367501 0.0233562 900 1.5734667 0.8191720
kHAP1000 - kHAP5000 0.1265959 0.0233562 900 5.4202356 0.0000027
kHAP1000 - kHAP10000 0.2537769 0.0233562 900 10.8655269 0.0000000
kHAP1000 - kHAP20000 0.3260901 0.0233562 900 13.9616368 0.0000000
kHAP1000 - kHAP30000 0.3746961 0.0233562 900 16.0427124 0.0000000
kHAP2500 - kHAP5000 0.0898457 0.0233562 900 3.8467689 0.0040590
kHAP2500 - kHAP10000 0.2170268 0.0233562 900 9.2920602 0.0000000
kHAP2500 - kHAP20000 0.2893400 0.0233562 900 12.3881701 0.0000000
kHAP2500 - kHAP30000 0.3379459 0.0233562 900 14.4692457 0.0000000
kHAP5000 - kHAP10000 0.1271811 0.0233562 900 5.4452913 0.0000024
kHAP5000 - kHAP20000 0.1994943 0.0233562 900 8.5414012 0.0000000
kHAP5000 - kHAP30000 0.2481002 0.0233562 900 10.6224768 0.0000000
kHAP10000 - kHAP20000 0.0723132 0.0233562 900 3.0961099 0.0519824
kHAP10000 - kHAP30000 0.1209191 0.0233562 900 5.1771855 0.0000098
kHAP20000 - kHAP30000 0.0486059 0.0233562 900 2.0810756 0.4869961
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 20.34980 2.543725 159.3915 0
Residuals 900 14.36308 0.015959 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.3173856 0.0125702 900 0.2927153 0.3420559
kHAP250 0.3145725 0.0125702 900 0.2899022 0.3392428
kHAP500 0.3103644 0.0125702 900 0.2856941 0.3350347
kHAP1000 0.2988900 0.0125702 900 0.2742197 0.3235603
kHAP2500 0.2621399 0.0125702 900 0.2374696 0.2868102
kHAP5000 0.1722942 0.0125702 900 0.1476239 0.1969645
kHAP10000 0.0451131 0.0125702 900 0.0204428 0.0697834
kHAP20000 -0.0272001 0.0125702 900 -0.0518704 -0.0025298
kHAP30000 -0.0758060 0.0125702 900 -0.1004763 -0.0511357
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0028131 0.0177769 900 0.1582452 1.0000000
kHAP100 - kHAP500 0.0070212 0.0177769 900 0.3949623 0.9999837
kHAP100 - kHAP1000 0.0184956 0.0177769 900 1.0404267 0.9818242
kHAP100 - kHAP2500 0.0552457 0.0177769 900 3.1077199 0.0502388
kHAP100 - kHAP5000 0.1450915 0.0177769 900 8.1617823 0.0000000
kHAP100 - kHAP10000 0.2722725 0.0177769 900 15.3160573 0.0000000
kHAP100 - kHAP20000 0.3445857 0.0177769 900 19.3838693 0.0000000
kHAP100 - kHAP30000 0.3931917 0.0177769 900 22.1180825 0.0000000
kHAP250 - kHAP500 0.0042081 0.0177769 900 0.2367170 0.9999997
kHAP250 - kHAP1000 0.0156825 0.0177769 900 0.8821815 0.9938601
kHAP250 - kHAP2500 0.0524326 0.0177769 900 2.9494747 0.0787565
kHAP250 - kHAP5000 0.1422783 0.0177769 900 8.0035371 0.0000000
kHAP250 - kHAP10000 0.2694594 0.0177769 900 15.1578121 0.0000000
kHAP250 - kHAP20000 0.3417726 0.0177769 900 19.2256241 0.0000000
kHAP250 - kHAP30000 0.3903785 0.0177769 900 21.9598372 0.0000000
kHAP500 - kHAP1000 0.0114744 0.0177769 900 0.6454645 0.9993292
kHAP500 - kHAP2500 0.0482245 0.0177769 900 2.7127576 0.1446827
kHAP500 - kHAP5000 0.1380702 0.0177769 900 7.7668201 0.0000000
kHAP500 - kHAP10000 0.2652513 0.0177769 900 14.9210950 0.0000000
kHAP500 - kHAP20000 0.3375645 0.0177769 900 18.9889071 0.0000000
kHAP500 - kHAP30000 0.3861704 0.0177769 900 21.7231202 0.0000000
kHAP1000 - kHAP2500 0.0367501 0.0177769 900 2.0672932 0.4965760
kHAP1000 - kHAP5000 0.1265959 0.0177769 900 7.1213556 0.0000000
kHAP1000 - kHAP10000 0.2537769 0.0177769 900 14.2756306 0.0000000
kHAP1000 - kHAP20000 0.3260901 0.0177769 900 18.3434426 0.0000000
kHAP1000 - kHAP30000 0.3746961 0.0177769 900 21.0776557 0.0000000
kHAP2500 - kHAP5000 0.0898457 0.0177769 900 5.0540624 0.0000185
kHAP2500 - kHAP10000 0.2170268 0.0177769 900 12.2083374 0.0000000
kHAP2500 - kHAP20000 0.2893400 0.0177769 900 16.2761494 0.0000000
kHAP2500 - kHAP30000 0.3379459 0.0177769 900 19.0103626 0.0000000
kHAP5000 - kHAP10000 0.1271811 0.0177769 900 7.1542750 0.0000000
kHAP5000 - kHAP20000 0.1994943 0.0177769 900 11.2220870 0.0000000
kHAP5000 - kHAP30000 0.2481002 0.0177769 900 13.9563001 0.0000000
kHAP10000 - kHAP20000 0.0723132 0.0177769 900 4.0678120 0.0016920
kHAP10000 - kHAP30000 0.1209191 0.0177769 900 6.8020252 0.0000000
kHAP20000 - kHAP30000 0.0486059 0.0177769 900 2.7342131 0.1373727

HiMC Macrohaplogrouping

The trend of which can be further seen when only macro-haplogroups are considered:
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 10.38166 1.2977077 46.20055 0
Residuals 900 25.27972 0.0280886 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.8992230 0.0166765 900 0.8664937 0.9319523
kHAP250 0.8966250 0.0166765 900 0.8638957 0.9293543
kHAP500 0.8937972 0.0166765 900 0.8610679 0.9265265
kHAP1000 0.8863748 0.0166765 900 0.8536455 0.9191041
kHAP2500 0.8616677 0.0166765 900 0.8289384 0.8943970
kHAP5000 0.8112020 0.0166765 900 0.7784727 0.8439313
kHAP10000 0.7110193 0.0166765 900 0.6782900 0.7437486
kHAP20000 0.6477076 0.0166765 900 0.6149783 0.6804369
kHAP30000 0.6200152 0.0166765 900 0.5872859 0.6527445
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0025980 0.0235841 900 0.1101600 1.0000000
kHAP100 - kHAP500 0.0054258 0.0235841 900 0.2300617 0.9999998
kHAP100 - kHAP1000 0.0128482 0.0235841 900 0.5447840 0.9998098
kHAP100 - kHAP2500 0.0375553 0.0235841 900 1.5924014 0.8090568
kHAP100 - kHAP5000 0.0880210 0.0235841 900 3.7322197 0.0062535
kHAP100 - kHAP10000 0.1882037 0.0235841 900 7.9801129 0.0000000
kHAP100 - kHAP20000 0.2515154 0.0235841 900 10.6646214 0.0000000
kHAP100 - kHAP30000 0.2792078 0.0235841 900 11.8388188 0.0000000
kHAP250 - kHAP500 0.0028278 0.0235841 900 0.1199016 1.0000000
kHAP250 - kHAP1000 0.0102502 0.0235841 900 0.4346239 0.9999659
kHAP250 - kHAP2500 0.0349573 0.0235841 900 1.4822414 0.8638095
kHAP250 - kHAP5000 0.0854230 0.0235841 900 3.6220597 0.0093438
kHAP250 - kHAP10000 0.1856057 0.0235841 900 7.8699529 0.0000000
kHAP250 - kHAP20000 0.2489174 0.0235841 900 10.5544613 0.0000000
kHAP250 - kHAP30000 0.2766098 0.0235841 900 11.7286587 0.0000000
kHAP500 - kHAP1000 0.0074224 0.0235841 900 0.3147223 0.9999972
kHAP500 - kHAP2500 0.0321295 0.0235841 900 1.3623398 0.9115484
kHAP500 - kHAP5000 0.0825952 0.0235841 900 3.5021580 0.0142352
kHAP500 - kHAP10000 0.1827779 0.0235841 900 7.7500513 0.0000000
kHAP500 - kHAP20000 0.2460896 0.0235841 900 10.4345597 0.0000000
kHAP500 - kHAP30000 0.2737820 0.0235841 900 11.6087571 0.0000000
kHAP1000 - kHAP2500 0.0247071 0.0235841 900 1.0476175 0.9810132
kHAP1000 - kHAP5000 0.0751728 0.0235841 900 3.1874357 0.0395584
kHAP1000 - kHAP10000 0.1753555 0.0235841 900 7.4353290 0.0000000
kHAP1000 - kHAP20000 0.2386672 0.0235841 900 10.1198374 0.0000000
kHAP1000 - kHAP30000 0.2663596 0.0235841 900 11.2940348 0.0000000
kHAP2500 - kHAP5000 0.0504657 0.0235841 900 2.1398182 0.4466629
kHAP2500 - kHAP10000 0.1506484 0.0235841 900 6.3877115 0.0000000
kHAP2500 - kHAP20000 0.2139601 0.0235841 900 9.0722199 0.0000000
kHAP2500 - kHAP30000 0.2416525 0.0235841 900 10.2464173 0.0000000
kHAP5000 - kHAP10000 0.1001827 0.0235841 900 4.2478933 0.0007980
kHAP5000 - kHAP20000 0.1634944 0.0235841 900 6.9324017 0.0000000
kHAP5000 - kHAP30000 0.1911868 0.0235841 900 8.1065991 0.0000000
kHAP10000 - kHAP20000 0.0633117 0.0235841 900 2.6845084 0.1547462
kHAP10000 - kHAP30000 0.0910041 0.0235841 900 3.8587058 0.0038769
kHAP20000 - kHAP30000 0.0276924 0.0235841 900 1.1741974 0.9617539
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 10.38166 1.2977077 57.74605 0
Residuals 900 20.22540 0.0224727 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.2417318 0.0149165 900 0.2124566 0.2710069
kHAP250 0.2391337 0.0149165 900 0.2098586 0.2684089
kHAP500 0.2363060 0.0149165 900 0.2070308 0.2655811
kHAP1000 0.2288835 0.0149165 900 0.1996084 0.2581587
kHAP2500 0.2041764 0.0149165 900 0.1749013 0.2334516
kHAP5000 0.1537108 0.0149165 900 0.1244356 0.1829859
kHAP10000 0.0535281 0.0149165 900 0.0242529 0.0828032
kHAP20000 -0.0097836 0.0149165 900 -0.0390588 0.0194915
kHAP30000 -0.0374760 0.0149165 900 -0.0667512 -0.0082009
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0025980 0.0210951 900 0.1231577 1.0000000
kHAP100 - kHAP500 0.0054258 0.0210951 900 0.2572064 0.9999994
kHAP100 - kHAP1000 0.0128482 0.0210951 900 0.6090625 0.9995627
kHAP100 - kHAP2500 0.0375553 0.0210951 900 1.7802873 0.6953597
kHAP100 - kHAP5000 0.0880210 0.0210951 900 4.1725807 0.0010972
kHAP100 - kHAP10000 0.1882037 0.0210951 900 8.9216787 0.0000000
kHAP100 - kHAP20000 0.2515154 0.0210951 900 11.9229297 0.0000000
kHAP100 - kHAP30000 0.2792078 0.0210951 900 13.2356695 0.0000000
kHAP250 - kHAP500 0.0028278 0.0210951 900 0.1340487 1.0000000
kHAP250 - kHAP1000 0.0102502 0.0210951 900 0.4859048 0.9999200
kHAP250 - kHAP2500 0.0349573 0.0210951 900 1.6571296 0.7724479
kHAP250 - kHAP5000 0.0854230 0.0210951 900 4.0494230 0.0018234
kHAP250 - kHAP10000 0.1856057 0.0210951 900 8.7985210 0.0000000
kHAP250 - kHAP20000 0.2489174 0.0210951 900 11.7997720 0.0000000
kHAP250 - kHAP30000 0.2766098 0.0210951 900 13.1125118 0.0000000
kHAP500 - kHAP1000 0.0074224 0.0210951 900 0.3518561 0.9999934
kHAP500 - kHAP2500 0.0321295 0.0210951 900 1.5230809 0.8446867
kHAP500 - kHAP5000 0.0825952 0.0210951 900 3.9153742 0.0031114
kHAP500 - kHAP10000 0.1827779 0.0210951 900 8.6644723 0.0000000
kHAP500 - kHAP20000 0.2460896 0.0210951 900 11.6657233 0.0000000
kHAP500 - kHAP30000 0.2737820 0.0210951 900 12.9784631 0.0000000
kHAP1000 - kHAP2500 0.0247071 0.0210951 900 1.1712248 0.9623259
kHAP1000 - kHAP5000 0.0751728 0.0210951 900 3.5635182 0.0115004
kHAP1000 - kHAP10000 0.1753555 0.0210951 900 8.3126162 0.0000000
kHAP1000 - kHAP20000 0.2386672 0.0210951 900 11.3138672 0.0000000
kHAP1000 - kHAP30000 0.2663596 0.0210951 900 12.6266070 0.0000000
kHAP2500 - kHAP5000 0.0504657 0.0210951 900 2.3922933 0.2893352
kHAP2500 - kHAP10000 0.1506484 0.0210951 900 7.1413914 0.0000000
kHAP2500 - kHAP20000 0.2139601 0.0210951 900 10.1426424 0.0000000
kHAP2500 - kHAP30000 0.2416525 0.0210951 900 11.4553822 0.0000000
kHAP5000 - kHAP10000 0.1001827 0.0210951 900 4.7490981 0.0000828
kHAP5000 - kHAP20000 0.1634944 0.0210951 900 7.7503490 0.0000000
kHAP5000 - kHAP30000 0.1911868 0.0210951 900 9.0630889 0.0000000
kHAP10000 - kHAP20000 0.0633117 0.0210951 900 3.0012510 0.0682356
kHAP10000 - kHAP30000 0.0910041 0.0210951 900 4.3139908 0.0006005
kHAP20000 - kHAP30000 0.0276924 0.0210951 900 1.3127398 0.9276140

These results suggest that there is no statistically significant difference in accurate assignment of haplogroups or macrohaplogroups between different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds.

HaploGrep 2.0

HaploGrep Haplogrouping

We are investigating using HaploGrep 2.0 for assigning haplogroups, as HaploGrep has a greater ability to assign haplogroups that cover all sub-groupings.

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 1.451465 0.1814331 22.2032 0
Residuals 900 7.354335 0.0081715 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.1848797 0.0089948 900 0.1672265 0.2025328
kHAP250 0.1800738 0.0089948 900 0.1624206 0.1977269
kHAP500 0.1673830 0.0089948 900 0.1497298 0.1850362
kHAP1000 0.1517032 0.0089948 900 0.1340500 0.1693563
kHAP2500 0.1267280 0.0089948 900 0.1090748 0.1443811
kHAP5000 0.1056522 0.0089948 900 0.0879991 0.1233054
kHAP10000 0.0909062 0.0089948 900 0.0732531 0.1085594
kHAP20000 0.0830683 0.0089948 900 0.0654151 0.1007214
kHAP30000 0.0785320 0.0089948 900 0.0608788 0.0961851
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0048059 0.0127205 900 0.3778091 0.9999885
kHAP100 - kHAP500 0.0174967 0.0127205 900 1.3754707 0.9069373
kHAP100 - kHAP1000 0.0331765 0.0127205 900 2.6081111 0.1845241
kHAP100 - kHAP2500 0.0581517 0.0127205 900 4.5714895 0.0001901
kHAP100 - kHAP5000 0.0792275 0.0127205 900 6.2283204 0.0000000
kHAP100 - kHAP10000 0.0939734 0.0127205 900 7.3875492 0.0000000
kHAP100 - kHAP20000 0.1018114 0.0127205 900 8.0037158 0.0000000
kHAP100 - kHAP30000 0.1063477 0.0127205 900 8.3603307 0.0000000
kHAP250 - kHAP500 0.0126908 0.0127205 900 0.9976616 0.9861121
kHAP250 - kHAP1000 0.0283706 0.0127205 900 2.2303021 0.3867648
kHAP250 - kHAP2500 0.0533458 0.0127205 900 4.1936805 0.0010042
kHAP250 - kHAP5000 0.0744215 0.0127205 900 5.8505114 0.0000002
kHAP250 - kHAP10000 0.0891675 0.0127205 900 7.0097401 0.0000000
kHAP250 - kHAP20000 0.0970055 0.0127205 900 7.6259068 0.0000000
kHAP250 - kHAP30000 0.1015418 0.0127205 900 7.9825217 0.0000000
kHAP500 - kHAP1000 0.0156798 0.0127205 900 1.2326404 0.9491916
kHAP500 - kHAP2500 0.0406550 0.0127205 900 3.1960188 0.0385344
kHAP500 - kHAP5000 0.0617308 0.0127205 900 4.8528498 0.0000502
kHAP500 - kHAP10000 0.0764767 0.0127205 900 6.0120785 0.0000001
kHAP500 - kHAP20000 0.0843147 0.0127205 900 6.6282451 0.0000000
kHAP500 - kHAP30000 0.0888510 0.0127205 900 6.9848600 0.0000000
kHAP1000 - kHAP2500 0.0249752 0.0127205 900 1.9633784 0.5695169
kHAP1000 - kHAP5000 0.0460509 0.0127205 900 3.6202093 0.0094059
kHAP1000 - kHAP10000 0.0607969 0.0127205 900 4.7794381 0.0000716
kHAP1000 - kHAP20000 0.0686349 0.0127205 900 5.3956047 0.0000031
kHAP1000 - kHAP30000 0.0731712 0.0127205 900 5.7522196 0.0000004
kHAP2500 - kHAP5000 0.0210757 0.0127205 900 1.6568309 0.7726237
kHAP2500 - kHAP10000 0.0358217 0.0127205 900 2.8160597 0.1120397
kHAP2500 - kHAP20000 0.0436597 0.0127205 900 3.4322263 0.0180539
kHAP2500 - kHAP30000 0.0481960 0.0127205 900 3.7888412 0.0050599
kHAP5000 - kHAP10000 0.0147460 0.0127205 900 1.1592287 0.9645710
kHAP5000 - kHAP20000 0.0225839 0.0127205 900 1.7753954 0.6985758
kHAP5000 - kHAP30000 0.0271203 0.0127205 900 2.1320103 0.4519694
kHAP10000 - kHAP20000 0.0078380 0.0127205 900 0.6161666 0.9995235
kHAP10000 - kHAP30000 0.0123743 0.0127205 900 0.9727815 0.9882167
kHAP20000 - kHAP30000 0.0045363 0.0127205 900 0.3566149 0.9999926
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 1.451465 0.1814331 153.1917 0
Residuals 900 1.065918 0.0011844 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 -0.0219979 0.0034244 900 -0.0287185 -0.0152772
kHAP250 -0.0268038 0.0034244 900 -0.0335245 -0.0200831
kHAP500 -0.0394946 0.0034244 900 -0.0462152 -0.0327739
kHAP1000 -0.0551744 0.0034244 900 -0.0618951 -0.0484537
kHAP2500 -0.0801496 0.0034244 900 -0.0868702 -0.0734289
kHAP5000 -0.1012253 0.0034244 900 -0.1079460 -0.0945046
kHAP10000 -0.1159713 0.0034244 900 -0.1226920 -0.1092506
kHAP20000 -0.1238093 0.0034244 900 -0.1305299 -0.1170886
kHAP30000 -0.1283456 0.0034244 900 -0.1350663 -0.1216249
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0048059 0.0048428 900 0.9923895 0.9865808
kHAP100 - kHAP500 0.0174967 0.0048428 900 3.6129431 0.0096534
kHAP100 - kHAP1000 0.0331765 0.0048428 900 6.8507146 0.0000000
kHAP100 - kHAP2500 0.0581517 0.0048428 900 12.0079126 0.0000000
kHAP100 - kHAP5000 0.0792275 0.0048428 900 16.3599034 0.0000000
kHAP100 - kHAP10000 0.0939734 0.0048428 900 19.4048448 0.0000000
kHAP100 - kHAP20000 0.1018114 0.0048428 900 21.0233271 0.0000000
kHAP100 - kHAP30000 0.1063477 0.0048428 900 21.9600460 0.0000000
kHAP250 - kHAP500 0.0126908 0.0048428 900 2.6205537 0.1794155
kHAP250 - kHAP1000 0.0283706 0.0048428 900 5.8583251 0.0000002
kHAP250 - kHAP2500 0.0533458 0.0048428 900 11.0155231 0.0000000
kHAP250 - kHAP5000 0.0744215 0.0048428 900 15.3675140 0.0000000
kHAP250 - kHAP10000 0.0891675 0.0048428 900 18.4124553 0.0000000
kHAP250 - kHAP20000 0.0970055 0.0048428 900 20.0309377 0.0000000
kHAP250 - kHAP30000 0.1015418 0.0048428 900 20.9676565 0.0000000
kHAP500 - kHAP1000 0.0156798 0.0048428 900 3.2377715 0.0338735
kHAP500 - kHAP2500 0.0406550 0.0048428 900 8.3949694 0.0000000
kHAP500 - kHAP5000 0.0617308 0.0048428 900 12.7469603 0.0000000
kHAP500 - kHAP10000 0.0764767 0.0048428 900 15.7919017 0.0000000
kHAP500 - kHAP20000 0.0843147 0.0048428 900 17.4103840 0.0000000
kHAP500 - kHAP30000 0.0888510 0.0048428 900 18.3471028 0.0000000
kHAP1000 - kHAP2500 0.0249752 0.0048428 900 5.1571979 0.0000109
kHAP1000 - kHAP5000 0.0460509 0.0048428 900 9.5091888 0.0000000
kHAP1000 - kHAP10000 0.0607969 0.0048428 900 12.5541302 0.0000000
kHAP1000 - kHAP20000 0.0686349 0.0048428 900 14.1726125 0.0000000
kHAP1000 - kHAP30000 0.0731712 0.0048428 900 15.1093313 0.0000000
kHAP2500 - kHAP5000 0.0210757 0.0048428 900 4.3519909 0.0005089
kHAP2500 - kHAP10000 0.0358217 0.0048428 900 7.3969322 0.0000000
kHAP2500 - kHAP20000 0.0436597 0.0048428 900 9.0154146 0.0000000
kHAP2500 - kHAP30000 0.0481960 0.0048428 900 9.9521334 0.0000000
kHAP5000 - kHAP10000 0.0147460 0.0048428 900 3.0449413 0.0602891
kHAP5000 - kHAP20000 0.0225839 0.0048428 900 4.6634237 0.0001241
kHAP5000 - kHAP30000 0.0271203 0.0048428 900 5.6001425 0.0000010
kHAP10000 - kHAP20000 0.0078380 0.0048428 900 1.6184823 0.7946737
kHAP10000 - kHAP30000 0.0123743 0.0048428 900 2.5552012 0.2073877
kHAP20000 - kHAP30000 0.0045363 0.0048428 900 0.9367188 0.9908144

HaploGrep Macrohaplogrouping

The trend of which can be further seen when only macro-haplogroups are considered:
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 9.947415 1.2434269 41.6241 0
Residuals 900 26.885486 0.0298728 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.8890144 0.017198 900 0.8552616 0.9227671
kHAP250 0.8868615 0.017198 900 0.8531087 0.9206142
kHAP500 0.8840795 0.017198 900 0.8503267 0.9178323
kHAP1000 0.8789454 0.017198 900 0.8451926 0.9126981
kHAP2500 0.8538139 0.017198 900 0.8200611 0.8875667
kHAP5000 0.8002649 0.017198 900 0.7665121 0.8340177
kHAP10000 0.7017161 0.017198 900 0.6679633 0.7354688
kHAP20000 0.6522580 0.017198 900 0.6185052 0.6860108
kHAP30000 0.6115131 0.017198 900 0.5777603 0.6452659
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0021529 0.0243216 900 0.0885180 1.0000000
kHAP100 - kHAP500 0.0049349 0.0243216 900 0.2029006 0.9999999
kHAP100 - kHAP1000 0.0100690 0.0243216 900 0.4139944 0.9999766
kHAP100 - kHAP2500 0.0352005 0.0243216 900 1.4472935 0.8790296
kHAP100 - kHAP5000 0.0887494 0.0243216 900 3.6489981 0.0084808
kHAP100 - kHAP10000 0.1872983 0.0243216 900 7.7009056 0.0000000
kHAP100 - kHAP20000 0.2367564 0.0243216 900 9.7344100 0.0000000
kHAP100 - kHAP30000 0.2775012 0.0243216 900 11.4096654 0.0000000
kHAP250 - kHAP500 0.0027820 0.0243216 900 0.1143826 1.0000000
kHAP250 - kHAP1000 0.0079161 0.0243216 900 0.3254764 0.9999964
kHAP250 - kHAP2500 0.0330476 0.0243216 900 1.3587754 0.9127740
kHAP250 - kHAP5000 0.0865965 0.0243216 900 3.5604801 0.0116238
kHAP250 - kHAP10000 0.1851454 0.0243216 900 7.6123876 0.0000000
kHAP250 - kHAP20000 0.2346035 0.0243216 900 9.6458920 0.0000000
kHAP250 - kHAP30000 0.2753483 0.0243216 900 11.3211474 0.0000000
kHAP500 - kHAP1000 0.0051341 0.0243216 900 0.2110938 0.9999999
kHAP500 - kHAP2500 0.0302656 0.0243216 900 1.2443928 0.9463509
kHAP500 - kHAP5000 0.0838146 0.0243216 900 3.4460974 0.0172307
kHAP500 - kHAP10000 0.1823634 0.0243216 900 7.4980050 0.0000000
kHAP500 - kHAP20000 0.2318215 0.0243216 900 9.5315094 0.0000000
kHAP500 - kHAP30000 0.2725664 0.0243216 900 11.2067648 0.0000000
kHAP1000 - kHAP2500 0.0251315 0.0243216 900 1.0332991 0.9826016
kHAP1000 - kHAP5000 0.0786804 0.0243216 900 3.2350037 0.0341666
kHAP1000 - kHAP10000 0.1772293 0.0243216 900 7.2869113 0.0000000
kHAP1000 - kHAP20000 0.2266873 0.0243216 900 9.3204156 0.0000000
kHAP1000 - kHAP30000 0.2674322 0.0243216 900 10.9956710 0.0000000
kHAP2500 - kHAP5000 0.0535490 0.0243216 900 2.2017046 0.4053450
kHAP2500 - kHAP10000 0.1520978 0.0243216 900 6.2536122 0.0000000
kHAP2500 - kHAP20000 0.2015559 0.0243216 900 8.2871165 0.0000000
kHAP2500 - kHAP30000 0.2423007 0.0243216 900 9.9623719 0.0000000
kHAP5000 - kHAP10000 0.0985488 0.0243216 900 4.0519076 0.0018051
kHAP5000 - kHAP20000 0.1480069 0.0243216 900 6.0854119 0.0000001
kHAP5000 - kHAP30000 0.1887518 0.0243216 900 7.7606673 0.0000000
kHAP10000 - kHAP20000 0.0494581 0.0243216 900 2.0335043 0.5201945
kHAP10000 - kHAP30000 0.0902029 0.0243216 900 3.7087598 0.0068198
kHAP20000 - kHAP30000 0.0407449 0.0243216 900 1.6752554 0.7616696
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 9.947415 1.2434269 318.3801 0
Residuals 900 3.514932 0.0039055 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.0068846 0.0062184 900 -0.0053196 0.0190888
kHAP250 0.0047317 0.0062184 900 -0.0074725 0.0169359
kHAP500 0.0019497 0.0062184 900 -0.0102545 0.0141539
kHAP1000 -0.0031844 0.0062184 900 -0.0153886 0.0090198
kHAP2500 -0.0283159 0.0062184 900 -0.0405201 -0.0161117
kHAP5000 -0.0818649 0.0062184 900 -0.0940690 -0.0696607
kHAP10000 -0.1804137 0.0062184 900 -0.1926179 -0.1682095
kHAP20000 -0.2298718 0.0062184 900 -0.2420760 -0.2176676
kHAP30000 -0.2706166 0.0062184 900 -0.2828208 -0.2584125
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0021529 0.0087941 900 0.2448117 0.9999996
kHAP100 - kHAP500 0.0049349 0.0087941 900 0.5611563 0.9997625
kHAP100 - kHAP1000 0.0100690 0.0087941 900 1.1449722 0.9671100
kHAP100 - kHAP2500 0.0352005 0.0087941 900 4.0027374 0.0022012
kHAP100 - kHAP5000 0.0887494 0.0087941 900 10.0919279 0.0000000
kHAP100 - kHAP10000 0.1872983 0.0087941 900 21.2981709 0.0000000
kHAP100 - kHAP20000 0.2367564 0.0087941 900 26.9221748 0.0000000
kHAP100 - kHAP30000 0.2775012 0.0087941 900 31.5553800 0.0000000
kHAP250 - kHAP500 0.0027820 0.0087941 900 0.3163447 0.9999971
kHAP250 - kHAP1000 0.0079161 0.0087941 900 0.9001605 0.9929625
kHAP250 - kHAP2500 0.0330476 0.0087941 900 3.7579258 0.0056827
kHAP250 - kHAP5000 0.0865965 0.0087941 900 9.8471162 0.0000000
kHAP250 - kHAP10000 0.1851454 0.0087941 900 21.0533593 0.0000000
kHAP250 - kHAP20000 0.2346035 0.0087941 900 26.6773631 0.0000000
kHAP250 - kHAP30000 0.2753483 0.0087941 900 31.3105683 0.0000000
kHAP500 - kHAP1000 0.0051341 0.0087941 900 0.5838159 0.9996807
kHAP500 - kHAP2500 0.0302656 0.0087941 900 3.4415811 0.0174949
kHAP500 - kHAP5000 0.0838146 0.0087941 900 9.5307715 0.0000000
kHAP500 - kHAP10000 0.1823634 0.0087941 900 20.7370146 0.0000000
kHAP500 - kHAP20000 0.2318215 0.0087941 900 26.3610184 0.0000000
kHAP500 - kHAP30000 0.2725664 0.0087941 900 30.9942237 0.0000000
kHAP1000 - kHAP2500 0.0251315 0.0087941 900 2.8577653 0.1006177
kHAP1000 - kHAP5000 0.0786804 0.0087941 900 8.9469557 0.0000000
kHAP1000 - kHAP10000 0.1772293 0.0087941 900 20.1531987 0.0000000
kHAP1000 - kHAP20000 0.2266873 0.0087941 900 25.7772026 0.0000000
kHAP1000 - kHAP30000 0.2674322 0.0087941 900 30.4104078 0.0000000
kHAP2500 - kHAP5000 0.0535490 0.0087941 900 6.0891904 0.0000001
kHAP2500 - kHAP10000 0.1520978 0.0087941 900 17.2954335 0.0000000
kHAP2500 - kHAP20000 0.2015559 0.0087941 900 22.9194373 0.0000000
kHAP2500 - kHAP30000 0.2423007 0.0087941 900 27.5526426 0.0000000
kHAP5000 - kHAP10000 0.0985488 0.0087941 900 11.2062431 0.0000000
kHAP5000 - kHAP20000 0.1480069 0.0087941 900 16.8302469 0.0000000
kHAP5000 - kHAP30000 0.1887518 0.0087941 900 21.4634521 0.0000000
kHAP10000 - kHAP20000 0.0494581 0.0087941 900 5.6240038 0.0000009
kHAP10000 - kHAP30000 0.0902029 0.0087941 900 10.2572091 0.0000000
kHAP20000 - kHAP30000 0.0407449 0.0087941 900 4.6332052 0.0001429

HaploGrep Haplogrouping (with info > 0.3 cutoff)

It should be noted that, by convention, imputed variants with an IMPUTE2 info score of info <= 0.3 are excluded from the final datasets. As such, I have also displayed these results where I have excluded any imputed sites within an info score info <= 0.3.

Imputed haplogroup corcordance, :
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Difference in haplogroup concordance between genotyped and imputed datasets with (cutoff info <= 0.3):
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the means of imputed haplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 0.1359959 0.0169995 1.807766 0.0720361
Residuals 900 8.4632264 0.0094036 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.1921550 0.0096491 900 0.1732177 0.2110923
kHAP250 0.1912446 0.0096491 900 0.1723073 0.2101819
kHAP500 0.1821798 0.0096491 900 0.1632425 0.2011171
kHAP1000 0.1722632 0.0096491 900 0.1533258 0.1912005
kHAP2500 0.1650621 0.0096491 900 0.1461248 0.1839994
kHAP5000 0.1617448 0.0096491 900 0.1428075 0.1806821
kHAP10000 0.1754163 0.0096491 900 0.1564790 0.1943536
kHAP20000 0.1834067 0.0096491 900 0.1644693 0.2023440
kHAP30000 0.2005048 0.0096491 900 0.1815675 0.2194421
Table showing the contrasts for the linear model testing for significant difference in the means of imputed haplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0009104 0.0136459 900 0.0667155 1.0000000
kHAP100 - kHAP500 0.0099752 0.0136459 900 0.7310074 0.9983468
kHAP100 - kHAP1000 0.0198918 0.0136459 900 1.4577199 0.8746005
kHAP100 - kHAP2500 0.0270929 0.0136459 900 1.9854311 0.5539929
kHAP100 - kHAP5000 0.0304102 0.0136459 900 2.2285276 0.3879072
kHAP100 - kHAP10000 0.0167387 0.0136459 900 1.2266494 0.9505981
kHAP100 - kHAP20000 0.0087483 0.0136459 900 0.6410990 0.9993618
kHAP100 - kHAP30000 -0.0083498 0.0136459 900 -0.6118930 0.9995474
kHAP250 - kHAP500 0.0090648 0.0136459 900 0.6642919 0.9991721
kHAP250 - kHAP1000 0.0189815 0.0136459 900 1.3910044 0.9012870
kHAP250 - kHAP2500 0.0261825 0.0136459 900 1.9187156 0.6008699
kHAP250 - kHAP5000 0.0294998 0.0136459 900 2.1618120 0.4318207
kHAP250 - kHAP10000 0.0158283 0.0136459 900 1.1599339 0.9644418
kHAP250 - kHAP20000 0.0078380 0.0136459 900 0.5743834 0.9997172
kHAP250 - kHAP30000 -0.0092602 0.0136459 900 -0.6786085 0.9990331
kHAP500 - kHAP1000 0.0099166 0.0136459 900 0.7267124 0.9984151
kHAP500 - kHAP2500 0.0171177 0.0136459 900 1.2544236 0.9438398
kHAP500 - kHAP5000 0.0204350 0.0136459 900 1.4975201 0.8568223
kHAP500 - kHAP10000 0.0067635 0.0136459 900 0.4956419 0.9999070
kHAP500 - kHAP20000 -0.0012269 0.0136459 900 -0.0899085 1.0000000
kHAP500 - kHAP30000 -0.0183250 0.0136459 900 -1.3429005 0.9180982
kHAP1000 - kHAP2500 0.0072011 0.0136459 900 0.5277112 0.9998504
kHAP1000 - kHAP5000 0.0105183 0.0136459 900 0.7708077 0.9975902
kHAP1000 - kHAP10000 -0.0031532 0.0136459 900 -0.2310705 0.9999998
kHAP1000 - kHAP20000 -0.0111435 0.0136459 900 -0.8166209 0.9963884
kHAP1000 - kHAP30000 -0.0282417 0.0136459 900 -2.0696129 0.4949611
kHAP2500 - kHAP5000 0.0033173 0.0136459 900 0.2430965 0.9999996
kHAP2500 - kHAP10000 -0.0103542 0.0136459 900 -0.7587817 0.9978440
kHAP2500 - kHAP20000 -0.0183446 0.0136459 900 -1.3443321 0.9176271
kHAP2500 - kHAP30000 -0.0354427 0.0136459 900 -2.5973241 0.1890354
kHAP5000 - kHAP10000 -0.0136715 0.0136459 900 -1.0018782 0.9857282
kHAP5000 - kHAP20000 -0.0216618 0.0136459 900 -1.5874286 0.8117404
kHAP5000 - kHAP30000 -0.0387600 0.0136459 900 -2.8404206 0.1052504
kHAP10000 - kHAP20000 -0.0079903 0.0136459 900 -0.5855504 0.9996736
kHAP10000 - kHAP30000 -0.0250885 0.0136459 900 -1.8385424 0.6563077
kHAP20000 - kHAP30000 -0.0170982 0.0136459 900 -1.2529920 0.9442031
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned haplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 0.1359959 0.0169995 49.16032 0
Residuals 900 0.3112172 0.0003458 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 -0.0147225 0.0018503 900 -0.0183540 -0.0110911
kHAP250 -0.0156329 0.0018503 900 -0.0192644 -0.0120015
kHAP500 -0.0246978 0.0018503 900 -0.0283292 -0.0210663
kHAP1000 -0.0346144 0.0018503 900 -0.0382459 -0.0309829
kHAP2500 -0.0418155 0.0018503 900 -0.0454469 -0.0381840
kHAP5000 -0.0451327 0.0018503 900 -0.0487642 -0.0415013
kHAP10000 -0.0314612 0.0018503 900 -0.0350927 -0.0278298
kHAP20000 -0.0234709 0.0018503 900 -0.0271024 -0.0198394
kHAP30000 -0.0063727 0.0018503 900 -0.0100042 -0.0027413
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned haplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0009104 0.0026168 900 0.3479069 0.9999939
kHAP100 - kHAP500 0.0099752 0.0026168 900 3.8120447 0.0046344
kHAP100 - kHAP1000 0.0198918 0.0026168 900 7.6016920 0.0000000
kHAP100 - kHAP2500 0.0270929 0.0026168 900 10.3535911 0.0000000
kHAP100 - kHAP5000 0.0304102 0.0026168 900 11.6212863 0.0000000
kHAP100 - kHAP10000 0.0167387 0.0026168 900 6.3967096 0.0000000
kHAP100 - kHAP20000 0.0087483 0.0026168 900 3.3431916 0.0242232
kHAP100 - kHAP30000 -0.0083498 0.0026168 900 -3.1908890 0.0391436
kHAP250 - kHAP500 0.0090648 0.0026168 900 3.4641378 0.0162103
kHAP250 - kHAP1000 0.0189815 0.0026168 900 7.2537851 0.0000000
kHAP250 - kHAP2500 0.0261825 0.0026168 900 10.0056842 0.0000000
kHAP250 - kHAP5000 0.0294998 0.0026168 900 11.2733794 0.0000000
kHAP250 - kHAP10000 0.0158283 0.0026168 900 6.0488026 0.0000001
kHAP250 - kHAP20000 0.0078380 0.0026168 900 2.9952846 0.0693853
kHAP250 - kHAP30000 -0.0092602 0.0026168 900 -3.5387959 0.0125394
kHAP500 - kHAP1000 0.0099166 0.0026168 900 3.7896473 0.0050445
kHAP500 - kHAP2500 0.0171177 0.0026168 900 6.5415464 0.0000000
kHAP500 - kHAP5000 0.0204350 0.0026168 900 7.8092416 0.0000000
kHAP500 - kHAP10000 0.0067635 0.0026168 900 2.5846649 0.1944275
kHAP500 - kHAP20000 -0.0012269 0.0026168 900 -0.4688531 0.9999391
kHAP500 - kHAP30000 -0.0183250 0.0026168 900 -7.0029337 0.0000000
kHAP1000 - kHAP2500 0.0072011 0.0026168 900 2.7518991 0.1315597
kHAP1000 - kHAP5000 0.0105183 0.0026168 900 4.0195944 0.0020571
kHAP1000 - kHAP10000 -0.0031532 0.0026168 900 -1.2049824 0.9554549
kHAP1000 - kHAP20000 -0.0111435 0.0026168 900 -4.2585004 0.0007626
kHAP1000 - kHAP30000 -0.0282417 0.0026168 900 -10.7925810 0.0000000
kHAP2500 - kHAP5000 0.0033173 0.0026168 900 1.2676952 0.9403933
kHAP2500 - kHAP10000 -0.0103542 0.0026168 900 -3.9568815 0.0026424
kHAP2500 - kHAP20000 -0.0183446 0.0026168 900 -7.0103995 0.0000000
kHAP2500 - kHAP30000 -0.0354427 0.0026168 900 -13.5444801 0.0000000
kHAP5000 - kHAP10000 -0.0136715 0.0026168 900 -5.2245768 0.0000077
kHAP5000 - kHAP20000 -0.0216618 0.0026168 900 -8.2780948 0.0000000
kHAP5000 - kHAP30000 -0.0387600 0.0026168 900 -14.8121753 0.0000000
kHAP10000 - kHAP20000 -0.0079903 0.0026168 900 -3.0535180 0.0588237
kHAP10000 - kHAP30000 -0.0250885 0.0026168 900 -9.5875986 0.0000000
kHAP20000 - kHAP30000 -0.0170982 0.0026168 900 -6.5340806 0.0000000

HaploGrep Macrohaplogrouping (with info ≥ 0.3 cutoff)

The trend of which can be further seen when only macro-haplogroups are considered:

Compare this result with the imputed data, which shows a higher haplogroup concordance:
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

If the improvement in accurate assignment of haplogroups wasn’t evident from the last two plots, displaying the mean difference should make this clear:
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3

These can be statistically tested with linear models:

Table showing the residuals for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 0.4348943 0.0543618 2.502299 0.0108598
Residuals 900 19.5522685 0.0217247 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.8956450 0.0146662 900 0.8668611 0.9244288
kHAP250 0.8937851 0.0146662 900 0.8650012 0.9225690
kHAP500 0.8899990 0.0146662 900 0.8612151 0.9187829
kHAP1000 0.8851149 0.0146662 900 0.8563310 0.9138988
kHAP2500 0.8756515 0.0146662 900 0.8468677 0.9044354
kHAP5000 0.8520048 0.0146662 900 0.8232209 0.8807887
kHAP10000 0.8438504 0.0146662 900 0.8150665 0.8726342
kHAP20000 0.8324138 0.0146662 900 0.8036300 0.8611977
kHAP30000 0.8675830 0.0146662 900 0.8387992 0.8963669
Table showing the contrasts for the linear model testing for significant difference in the means of imputed macrohaplogroup concordance for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0018599 0.0207411 900 0.0896700 1.0000000
kHAP100 - kHAP500 0.0056460 0.0207411 900 0.2722126 0.9999991
kHAP100 - kHAP1000 0.0105301 0.0207411 900 0.5076906 0.9998883
kHAP100 - kHAP2500 0.0199934 0.0207411 900 0.9639529 0.9889003
kHAP100 - kHAP5000 0.0436402 0.0207411 900 2.1040433 0.4711207
kHAP100 - kHAP10000 0.0517946 0.0207411 900 2.4971974 0.2345817
kHAP100 - kHAP20000 0.0632311 0.0207411 900 3.0485928 0.0596615
kHAP100 - kHAP30000 0.0280619 0.0207411 900 1.3529626 0.9147490
kHAP250 - kHAP500 0.0037861 0.0207411 900 0.1825426 1.0000000
kHAP250 - kHAP1000 0.0086702 0.0207411 900 0.4180206 0.9999748
kHAP250 - kHAP2500 0.0181336 0.0207411 900 0.8742828 0.9942242
kHAP250 - kHAP5000 0.0417803 0.0207411 900 2.0143733 0.5336276
kHAP250 - kHAP10000 0.0499347 0.0207411 900 2.4075274 0.2809498
kHAP250 - kHAP20000 0.0613713 0.0207411 900 2.9589228 0.0767433
kHAP250 - kHAP30000 0.0262021 0.0207411 900 1.2632925 0.9415524
kHAP500 - kHAP1000 0.0048841 0.0207411 900 0.2354780 0.9999997
kHAP500 - kHAP2500 0.0143474 0.0207411 900 0.6917403 0.9988889
kHAP500 - kHAP5000 0.0379942 0.0207411 900 1.8318307 0.6608708
kHAP500 - kHAP10000 0.0461486 0.0207411 900 2.2249848 0.3901923
kHAP500 - kHAP20000 0.0575852 0.0207411 900 2.7763802 0.1238253
kHAP500 - kHAP30000 0.0224159 0.0207411 900 1.0807500 0.9769141
kHAP1000 - kHAP2500 0.0094634 0.0207411 900 0.4562622 0.9999505
kHAP1000 - kHAP5000 0.0331101 0.0207411 900 1.5963527 0.8069109
kHAP1000 - kHAP10000 0.0412645 0.0207411 900 1.9895068 0.5511235
kHAP1000 - kHAP20000 0.0527011 0.0207411 900 2.5409022 0.2138847
kHAP1000 - kHAP30000 0.0175319 0.0207411 900 0.8452719 0.9954155
kHAP2500 - kHAP5000 0.0236467 0.0207411 900 1.1400904 0.9679479
kHAP2500 - kHAP10000 0.0318012 0.0207411 900 1.5332446 0.8397080
kHAP2500 - kHAP20000 0.0432377 0.0207411 900 2.0846399 0.4845247
kHAP2500 - kHAP30000 0.0080685 0.0207411 900 0.3890097 0.9999855
kHAP5000 - kHAP10000 0.0081544 0.0207411 900 0.3931541 0.9999843
kHAP5000 - kHAP20000 0.0195910 0.0207411 900 0.9445495 0.9902930
kHAP5000 - kHAP30000 -0.0155782 0.0207411 900 -0.7510807 0.9979946
kHAP10000 - kHAP20000 0.0114365 0.0207411 900 0.5513954 0.9997918
kHAP10000 - kHAP30000 -0.0237327 0.0207411 900 -1.1442349 0.9672376
kHAP20000 - kHAP30000 -0.0351692 0.0207411 900 -1.6956302 0.7493002
Table showing the residuals for the linear model testing for significant difference in the mean concordance of assigned macroaplogroups between genotyped and imputed data
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 0.4348943 0.0543618 23.71659 0
Residuals 900 2.0629282 0.0022921 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.0135152 0.0047639 900 0.0041656 0.0228648
kHAP250 0.0116553 0.0047639 900 0.0023058 0.0210049
kHAP500 0.0078692 0.0047639 900 -0.0014804 0.0172188
kHAP1000 0.0029851 0.0047639 900 -0.0063644 0.0123347
kHAP2500 -0.0064782 0.0047639 900 -0.0158278 0.0028714
kHAP5000 -0.0301250 0.0047639 900 -0.0394745 -0.0207754
kHAP10000 -0.0382794 0.0047639 900 -0.0476290 -0.0289298
kHAP20000 -0.0497159 0.0047639 900 -0.0590655 -0.0403664
kHAP30000 -0.0145467 0.0047639 900 -0.0238963 -0.0051971
Table showing the contrasts for the linear model testing for significant difference in the the mean concordance of assigned macrohaplogroups between genotyped and imputed data for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0018599 0.0067371 900 0.2760602 0.9999990
kHAP100 - kHAP500 0.0056460 0.0067371 900 0.8380400 0.9956790
kHAP100 - kHAP1000 0.0105301 0.0067371 900 1.5629881 0.8246482
kHAP100 - kHAP2500 0.0199934 0.0067371 900 2.9676476 0.0749219
kHAP100 - kHAP5000 0.0436402 0.0067371 900 6.4775563 0.0000000
kHAP100 - kHAP10000 0.0517946 0.0067371 900 7.6879297 0.0000000
kHAP100 - kHAP20000 0.0632311 0.0067371 900 9.3854682 0.0000000
kHAP100 - kHAP30000 0.0280619 0.0067371 900 4.1652618 0.0011314
kHAP250 - kHAP500 0.0037861 0.0067371 900 0.5619798 0.9997598
kHAP250 - kHAP1000 0.0086702 0.0067371 900 1.2869279 0.9351435
kHAP250 - kHAP2500 0.0181336 0.0067371 900 2.6915873 0.1521771
kHAP250 - kHAP5000 0.0417803 0.0067371 900 6.2014961 0.0000000
kHAP250 - kHAP10000 0.0499347 0.0067371 900 7.4118695 0.0000000
kHAP250 - kHAP20000 0.0613713 0.0067371 900 9.1094079 0.0000000
kHAP250 - kHAP30000 0.0262021 0.0067371 900 3.8892016 0.0034456
kHAP500 - kHAP1000 0.0048841 0.0067371 900 0.7249481 0.9984425
kHAP500 - kHAP2500 0.0143474 0.0067371 900 2.1296076 0.4536061
kHAP500 - kHAP5000 0.0379942 0.0067371 900 5.6395163 0.0000008
kHAP500 - kHAP10000 0.0461486 0.0067371 900 6.8498897 0.0000000
kHAP500 - kHAP20000 0.0575852 0.0067371 900 8.5474282 0.0000000
kHAP500 - kHAP30000 0.0224159 0.0067371 900 3.3272218 0.0255082
kHAP1000 - kHAP2500 0.0094634 0.0067371 900 1.4046595 0.8961444
kHAP1000 - kHAP5000 0.0331101 0.0067371 900 4.9145682 0.0000371
kHAP1000 - kHAP10000 0.0412645 0.0067371 900 6.1249416 0.0000000
kHAP1000 - kHAP20000 0.0527011 0.0067371 900 7.8224801 0.0000000
kHAP1000 - kHAP30000 0.0175319 0.0067371 900 2.6022737 0.1869558
kHAP2500 - kHAP5000 0.0236467 0.0067371 900 3.5099088 0.0138601
kHAP2500 - kHAP10000 0.0318012 0.0067371 900 4.7202821 0.0000949
kHAP2500 - kHAP20000 0.0432377 0.0067371 900 6.4178206 0.0000000
kHAP2500 - kHAP30000 0.0080685 0.0067371 900 1.1976143 0.9570258
kHAP5000 - kHAP10000 0.0081544 0.0067371 900 1.2103734 0.9542798
kHAP5000 - kHAP20000 0.0195910 0.0067371 900 2.9079118 0.0881307
kHAP5000 - kHAP30000 -0.0155782 0.0067371 900 -2.3122945 0.3356545
kHAP10000 - kHAP20000 0.0114365 0.0067371 900 1.6975385 0.7481284
kHAP10000 - kHAP30000 -0.0237327 0.0067371 900 -3.5226678 0.0132621
kHAP20000 - kHAP30000 -0.0351692 0.0067371 900 -5.2202063 0.0000079

These results suggest that there is a statistically significant difference in accurate assignment of haplogroups between different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds. However, this improvement is tiny; therefore, the biological and practical significance of the improvement seems small.

These results suggest that there is no statistically significant difference in accurate assignment of macrohaplogroups between different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds. However, it should be noted that both the genotyped and imputed datasets allow HaploGrep to accurately call macrohaplogroups, with average accuracy in the high 80%s.

There is a slight increase in ability to accuracy call haplogroups when a filter of info > 0.3 is applied, but the biological and practical significance of the improvement again seems small.

HaploGrep haplogroup quality comparisons

We also examined the difference in HaploGrep’s quality score between the truthset, genotyped set, and imputed set.

Here I show the difference between the truth set and the genotyped set:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep.

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep.

Here I show the difference between the truth set and the imputed set:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Here I show the difference between the truth set and the imputed set with the info score filter info > 0.3:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Here it appears that relative to the truth set, the quality is still decreased.

However, I have also investigated the difference between the genotyped and imputed datasets to see if there is any improvement. I have only investigated the imputed dataset filtered with info > 0.3.
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

On average, there is a decrease in HaploGrep quality score.

HaploGrep string distance (Damerau-Levenshtein)

We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3

This result shows the Damerau-Levenshtein distance:
Boxplot of mean Damerau-Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Damerau-Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Damerau-Levenshtein string distance between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 7.701951 0.9627438 20.58027 0
Residuals 900 42.101947 0.0467799 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed significant difference in the Damerau-Levenshtein string distance between assigned haplogroups for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.2656427 0.0215213 900 0.2234049 0.3078805
kHAP250 0.2860894 0.0215213 900 0.2438515 0.3283272
kHAP500 0.3890456 0.0215213 900 0.3468078 0.4312835
kHAP1000 0.4004665 0.0215213 900 0.3582287 0.4427044
kHAP2500 0.4276142 0.0215213 900 0.3853763 0.4698520
kHAP5000 0.3446826 0.0215213 900 0.3024448 0.3869205
kHAP10000 0.2656466 0.0215213 900 0.2234088 0.3078844
kHAP20000 0.2287035 0.0215213 900 0.1864657 0.2709413
kHAP30000 0.1199958 0.0215213 900 0.0777579 0.1622336
Table showing the contrasts for the linear model testing for significant difference in the means of significant difference in the Damerau-Levenshtein string distance between assigned haplogroups for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 -0.0204467 0.0304358 900 -0.6717978 0.9991015
kHAP100 - kHAP500 -0.1234029 0.0304358 900 -4.0545368 0.0017859
kHAP100 - kHAP1000 -0.1348238 0.0304358 900 -4.4297834 0.0003610
kHAP100 - kHAP2500 -0.1619714 0.0304358 900 -5.3217481 0.0000046
kHAP100 - kHAP5000 -0.0790399 0.0304358 900 -2.5969422 0.1891965
kHAP100 - kHAP10000 -0.0000039 0.0304358 900 -0.0001284 1.0000000
kHAP100 - kHAP20000 0.0369392 0.0304358 900 1.2136780 0.9535486
kHAP100 - kHAP30000 0.1456469 0.0304358 900 4.7853882 0.0000696
kHAP250 - kHAP500 -0.1029562 0.0304358 900 -3.3827390 0.0212838
kHAP250 - kHAP1000 -0.1143771 0.0304358 900 -3.7579855 0.0056815
kHAP250 - kHAP2500 -0.1415248 0.0304358 900 -4.6499503 0.0001322
kHAP250 - kHAP5000 -0.0585932 0.0304358 900 -1.9251444 0.5963696
kHAP250 - kHAP10000 0.0204428 0.0304358 900 0.6716695 0.9991027
kHAP250 - kHAP20000 0.0573859 0.0304358 900 1.8854758 0.6240314
kHAP250 - kHAP30000 0.1660936 0.0304358 900 5.4571860 0.0000022
kHAP500 - kHAP1000 -0.0114209 0.0304358 900 -0.3752465 0.9999891
kHAP500 - kHAP2500 -0.0385685 0.0304358 900 -1.2672113 0.9405215
kHAP500 - kHAP5000 0.0443630 0.0304358 900 1.4575946 0.8746543
kHAP500 - kHAP10000 0.1233990 0.0304358 900 4.0544085 0.0017869
kHAP500 - kHAP20000 0.1603421 0.0304358 900 5.2682148 0.0000061
kHAP500 - kHAP30000 0.2690498 0.0304358 900 8.8399250 0.0000000
kHAP1000 - kHAP2500 -0.0271476 0.0304358 900 -0.8919647 0.9933839
kHAP1000 - kHAP5000 0.0557839 0.0304358 900 1.8328412 0.6601847
kHAP1000 - kHAP10000 0.1348199 0.0304358 900 4.4296550 0.0003612
kHAP1000 - kHAP20000 0.1717630 0.0304358 900 5.6434613 0.0000008
kHAP1000 - kHAP30000 0.2804707 0.0304358 900 9.2151716 0.0000000
kHAP2500 - kHAP5000 0.0829315 0.0304358 900 2.7248059 0.1405427
kHAP2500 - kHAP10000 0.1619675 0.0304358 900 5.3216197 0.0000046
kHAP2500 - kHAP20000 0.1989107 0.0304358 900 6.5354261 0.0000000
kHAP2500 - kHAP30000 0.3076184 0.0304358 900 10.1071363 0.0000000
kHAP5000 - kHAP10000 0.0790360 0.0304358 900 2.5968138 0.1892507
kHAP5000 - kHAP20000 0.1159791 0.0304358 900 3.8106202 0.0046595
kHAP5000 - kHAP30000 0.2246868 0.0304358 900 7.3823304 0.0000000
kHAP10000 - kHAP20000 0.0369431 0.0304358 900 1.2138063 0.9535201
kHAP10000 - kHAP30000 0.1456508 0.0304358 900 4.7855166 0.0000695
kHAP20000 - kHAP30000 0.1087077 0.0304358 900 3.5717102 0.0111737

HaploGrep string distance (Levenshtein)

We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3

This result shows the Levenshtein distance:
Boxplot of mean Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Levenshtein string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Levenshtein string distance between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 7.691524 0.9614406 20.52634 0
Residuals 900 42.155417 0.0468394 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed significant difference in the Levenshtein string distance between assigned haplogroups for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.2655997 0.021535 900 0.2233351 0.3078644
kHAP250 0.2860308 0.021535 900 0.2437661 0.3282954
kHAP500 0.3889792 0.021535 900 0.3467145 0.4312438
kHAP1000 0.4004079 0.021535 900 0.3581433 0.4426726
kHAP2500 0.4276259 0.021535 900 0.3853612 0.4698905
kHAP5000 0.3446787 0.021535 900 0.3024141 0.3869434
kHAP10000 0.2659045 0.021535 900 0.2236398 0.3081691
kHAP20000 0.2288402 0.021535 900 0.1865756 0.2711049
kHAP30000 0.1201286 0.021535 900 0.0778640 0.1623933
Table showing the contrasts for the linear model testing for significant difference in the means of significant difference in the Levenshtein string distance between assigned haplogroups for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 -0.0204310 0.0304551 900 -0.6708585 0.9991106
kHAP100 - kHAP500 -0.1233795 0.0304551 900 -4.0511949 0.0018104
kHAP100 - kHAP1000 -0.1348082 0.0304551 900 -4.4264599 0.0003664
kHAP100 - kHAP2500 -0.1620261 0.0304551 900 -5.3201681 0.0000047
kHAP100 - kHAP5000 -0.0790790 0.0304551 900 -2.5965777 0.1893504
kHAP100 - kHAP10000 -0.0003048 0.0304551 900 -0.0100071 1.0000000
kHAP100 - kHAP20000 0.0367595 0.0304551 900 1.2070064 0.9550163
kHAP100 - kHAP30000 0.1454711 0.0304551 900 4.7765790 0.0000726
kHAP250 - kHAP500 -0.1029484 0.0304551 900 -3.3803364 0.0214529
kHAP250 - kHAP1000 -0.1143771 0.0304551 900 -3.7556015 0.0057323
kHAP250 - kHAP2500 -0.1415951 0.0304551 900 -4.6493096 0.0001326
kHAP250 - kHAP5000 -0.0586479 0.0304551 900 -1.9257192 0.5959670
kHAP250 - kHAP10000 0.0201263 0.0304551 900 0.6608514 0.9992029
kHAP250 - kHAP20000 0.0571905 0.0304551 900 1.8778649 0.6293040
kHAP250 - kHAP30000 0.1659021 0.0304551 900 5.4474375 0.0000024
kHAP500 - kHAP1000 -0.0114287 0.0304551 900 -0.3752651 0.9999891
kHAP500 - kHAP2500 -0.0386467 0.0304551 900 -1.2689733 0.9400538
kHAP500 - kHAP5000 0.0443005 0.0304551 900 1.4546172 0.8759285
kHAP500 - kHAP10000 0.1230747 0.0304551 900 4.0411878 0.0018853
kHAP500 - kHAP20000 0.1601389 0.0304551 900 5.2582013 0.0000065
kHAP500 - kHAP30000 0.2688506 0.0304551 900 8.8277739 0.0000000
kHAP1000 - kHAP2500 -0.0272180 0.0304551 900 -0.8937082 0.9932959
kHAP1000 - kHAP5000 0.0557292 0.0304551 900 1.8298823 0.6621927
kHAP1000 - kHAP10000 0.1345034 0.0304551 900 4.4164529 0.0003830
kHAP1000 - kHAP20000 0.1715677 0.0304551 900 5.6334663 0.0000008
kHAP1000 - kHAP30000 0.2802793 0.0304551 900 9.2030390 0.0000000
kHAP2500 - kHAP5000 0.0829472 0.0304551 900 2.7235905 0.1409562
kHAP2500 - kHAP10000 0.1617214 0.0304551 900 5.3101611 0.0000049
kHAP2500 - kHAP20000 0.1987856 0.0304551 900 6.5271745 0.0000000
kHAP2500 - kHAP30000 0.3074972 0.0304551 900 10.0967472 0.0000000
kHAP5000 - kHAP10000 0.0787742 0.0304551 900 2.5865706 0.1936090
kHAP5000 - kHAP20000 0.1158385 0.0304551 900 3.8035841 0.0047856
kHAP5000 - kHAP30000 0.2245501 0.0304551 900 7.3731567 0.0000000
kHAP10000 - kHAP20000 0.0370642 0.0304551 900 1.2170135 0.9528022
kHAP10000 - kHAP30000 0.1457759 0.0304551 900 4.7865861 0.0000692
kHAP20000 - kHAP30000 0.1087116 0.0304551 900 3.5695726 0.0112581

HaploGrep string distance (Jaccard)

We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3

This result shows the Levenshtein distance:
Boxplot of mean Jaccard string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Jaccard string distance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Jaccard string distance between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 0.2274493 0.0284312 46.83543 0
Residuals 900 0.5463395 0.0006070 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of imputed significant difference in the Jaccard string distance between assigned haplogroups for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.0141382 0.0024516 900 0.0093267 0.0189497
kHAP250 0.0167605 0.0024516 900 0.0119490 0.0215720
kHAP500 0.0259972 0.0024516 900 0.0211857 0.0308087
kHAP1000 0.0369160 0.0024516 900 0.0321044 0.0417275
kHAP2500 0.0447019 0.0024516 900 0.0398904 0.0495134
kHAP5000 0.0586427 0.0024516 900 0.0538312 0.0634542
kHAP10000 0.0494711 0.0024516 900 0.0446596 0.0542826
kHAP20000 0.0430226 0.0024516 900 0.0382111 0.0478341
kHAP30000 0.0128689 0.0024516 900 0.0080574 0.0176804
Table showing the contrasts for the linear model testing for significant difference in the means of significant difference in the Jaccard string distance between assigned haplogroups for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 -0.0026223 0.0034671 900 -0.7563317 0.9978929
kHAP100 - kHAP500 -0.0118590 0.0034671 900 -3.4204517 0.0187800
kHAP100 - kHAP1000 -0.0227777 0.0034671 900 -6.5697133 0.0000000
kHAP100 - kHAP2500 -0.0305637 0.0034671 900 -8.8153859 0.0000000
kHAP100 - kHAP5000 -0.0445045 0.0034671 900 -12.8362952 0.0000000
kHAP100 - kHAP10000 -0.0353329 0.0034671 900 -10.1909453 0.0000000
kHAP100 - kHAP20000 -0.0288844 0.0034671 900 -8.3310394 0.0000000
kHAP100 - kHAP30000 0.0012693 0.0034671 900 0.3661078 0.9999910
kHAP250 - kHAP500 -0.0092367 0.0034671 900 -2.6641200 0.1623240
kHAP250 - kHAP1000 -0.0201555 0.0034671 900 -5.8133816 0.0000003
kHAP250 - kHAP2500 -0.0279414 0.0034671 900 -8.0590542 0.0000000
kHAP250 - kHAP5000 -0.0418822 0.0034671 900 -12.0799635 0.0000000
kHAP250 - kHAP10000 -0.0327106 0.0034671 900 -9.4346136 0.0000000
kHAP250 - kHAP20000 -0.0262621 0.0034671 900 -7.5747077 0.0000000
kHAP250 - kHAP30000 0.0038916 0.0034671 900 1.1224395 0.9708462
kHAP500 - kHAP1000 -0.0109188 0.0034671 900 -3.1492616 0.0444011
kHAP500 - kHAP2500 -0.0187047 0.0034671 900 -5.3949342 0.0000031
kHAP500 - kHAP5000 -0.0326455 0.0034671 900 -9.4158435 0.0000000
kHAP500 - kHAP10000 -0.0234739 0.0034671 900 -6.7704937 0.0000000
kHAP500 - kHAP20000 -0.0170254 0.0034671 900 -4.9105877 0.0000378
kHAP500 - kHAP30000 0.0131283 0.0034671 900 3.7865595 0.0051036
kHAP1000 - kHAP2500 -0.0077859 0.0034671 900 -2.2456726 0.3769298
kHAP1000 - kHAP5000 -0.0217268 0.0034671 900 -6.2665819 0.0000000
kHAP1000 - kHAP10000 -0.0125551 0.0034671 900 -3.6212320 0.0093715
kHAP1000 - kHAP20000 -0.0061067 0.0034671 900 -1.7613261 0.7077634
kHAP1000 - kHAP30000 0.0240471 0.0034671 900 6.9358211 0.0000000
kHAP2500 - kHAP5000 -0.0139408 0.0034671 900 -4.0209093 0.0020462
kHAP2500 - kHAP10000 -0.0047692 0.0034671 900 -1.3755595 0.9069056
kHAP2500 - kHAP20000 0.0016793 0.0034671 900 0.4843465 0.9999219
kHAP2500 - kHAP30000 0.0318330 0.0034671 900 9.1814937 0.0000000
kHAP5000 - kHAP10000 0.0091716 0.0034671 900 2.6453498 0.1695366
kHAP5000 - kHAP20000 0.0156201 0.0034671 900 4.5052558 0.0002572
kHAP5000 - kHAP30000 0.0457738 0.0034671 900 13.2024030 0.0000000
kHAP10000 - kHAP20000 0.0064485 0.0034671 900 1.8599059 0.6416899
kHAP10000 - kHAP30000 0.0366022 0.0034671 900 10.5570532 0.0000000
kHAP20000 - kHAP30000 0.0301537 0.0034671 900 8.6971472 0.0000000

Matthew’s Correlation Coefficient (MCC)

We also determined imputation accuracy using the Matthew’s correlation coefficient (MCC). The MCC is a more direct method of measuring the imputation accuracy of genotypes (as opposed to haplotypes).

Boxplot of mean Matthew's correlation coefficient between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean Matthew’s correlation coefficient between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the Matthew’s correlation coefficient between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 2.017058 0.2521323 24.0245 0
Residuals 900 9.445317 0.0104948 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means of Matthew’s correlation coefficient for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.8381114 0.0101936 900 0.8181054 0.8581173
kHAP250 0.8491270 0.0101936 900 0.8291211 0.8691330
kHAP500 0.8629537 0.0101936 900 0.8429477 0.8829596
kHAP1000 0.8793764 0.0101936 900 0.8593705 0.8993824
kHAP2500 0.8585136 0.0101936 900 0.8385077 0.8785196
kHAP5000 0.8618736 0.0101936 900 0.8418676 0.8818795
kHAP10000 0.9071607 0.0101936 900 0.8871547 0.9271666
kHAP20000 0.9660753 0.0101936 900 0.9460694 0.9860813
kHAP30000 0.9731845 0.0101936 900 0.9531786 0.9931905
Table showing the contrasts for the linear model testing for significant difference in the means of Matthew’s correlation coefficient for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 -0.0110157 0.0144159 900 -0.7641330 0.9977339
kHAP100 - kHAP500 -0.0248423 0.0144159 900 -1.7232577 0.7321242
kHAP100 - kHAP1000 -0.0412650 0.0144159 900 -2.8624702 0.0993893
kHAP100 - kHAP2500 -0.0204022 0.0144159 900 -1.4152609 0.8920384
kHAP100 - kHAP5000 -0.0237622 0.0144159 900 -1.6483338 0.7775986
kHAP100 - kHAP10000 -0.0690493 0.0144159 900 -4.7898051 0.0000681
kHAP100 - kHAP20000 -0.1279640 0.0144159 900 -8.8765937 0.0000000
kHAP100 - kHAP30000 -0.1350732 0.0144159 900 -9.3697441 0.0000000
kHAP250 - kHAP500 -0.0138266 0.0144159 900 -0.9591247 0.9892607
kHAP250 - kHAP1000 -0.0302494 0.0144159 900 -2.0983372 0.4750534
kHAP250 - kHAP2500 -0.0093866 0.0144159 900 -0.6511279 0.9992848
kHAP250 - kHAP5000 -0.0127465 0.0144159 900 -0.8842008 0.9937642
kHAP250 - kHAP10000 -0.0580336 0.0144159 900 -4.0256720 0.0020073
kHAP250 - kHAP20000 -0.1169483 0.0144159 900 -8.1124606 0.0000000
kHAP250 - kHAP30000 -0.1240575 0.0144159 900 -8.6056111 0.0000000
kHAP500 - kHAP1000 -0.0164228 0.0144159 900 -1.1392125 0.9680969
kHAP500 - kHAP2500 0.0044400 0.0144159 900 0.3079968 0.9999977
kHAP500 - kHAP5000 0.0010801 0.0144159 900 0.0749239 1.0000000
kHAP500 - kHAP10000 -0.0442070 0.0144159 900 -3.0665473 0.0566548
kHAP500 - kHAP20000 -0.1031217 0.0144159 900 -7.1533359 0.0000000
kHAP500 - kHAP30000 -0.1102309 0.0144159 900 -7.6464864 0.0000000
kHAP1000 - kHAP2500 0.0208628 0.0144159 900 1.4472093 0.8790649
kHAP1000 - kHAP5000 0.0175029 0.0144159 900 1.2141364 0.9534465
kHAP1000 - kHAP10000 -0.0277842 0.0144159 900 -1.9273348 0.5948350
kHAP1000 - kHAP20000 -0.0866989 0.0144159 900 -6.0141234 0.0000001
kHAP1000 - kHAP30000 -0.0938081 0.0144159 900 -6.5072739 0.0000000
kHAP2500 - kHAP5000 -0.0033600 0.0144159 900 -0.2330729 0.9999997
kHAP2500 - kHAP10000 -0.0486470 0.0144159 900 -3.3745441 0.0218655
kHAP2500 - kHAP20000 -0.1075617 0.0144159 900 -7.4613327 0.0000000
kHAP2500 - kHAP30000 -0.1146709 0.0144159 900 -7.9544832 0.0000000
kHAP5000 - kHAP10000 -0.0452871 0.0144159 900 -3.1414712 0.0454495
kHAP5000 - kHAP20000 -0.1042018 0.0144159 900 -7.2282598 0.0000000
kHAP5000 - kHAP30000 -0.1113110 0.0144159 900 -7.7214103 0.0000000
kHAP10000 - kHAP20000 -0.0589147 0.0144159 900 -4.0867886 0.0015657
kHAP10000 - kHAP30000 -0.0660239 0.0144159 900 -4.5799391 0.0001829
kHAP20000 - kHAP30000 -0.0071092 0.0144159 900 -0.4931505 0.9999105

IMPUTE2 INFO Score

We are also reporting IMPUTE2’s INFO score. Here I will plot INFO scores for both the raw imputed data, and the imputed data after info score filtering

Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the IMPUTE2 INFO Score between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 15.01398 1.8767479 90.61791 0
Residuals 900 18.63951 0.0207106 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means ofIMPUTE2 INFO Score for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.7569623 0.0143197 900 0.7288583 0.7850662
kHAP250 0.7423139 0.0143197 900 0.7142099 0.7704178
kHAP500 0.7253970 0.0143197 900 0.6972930 0.7535009
kHAP1000 0.6927940 0.0143197 900 0.6646900 0.7208980
kHAP2500 0.6200950 0.0143197 900 0.5919910 0.6481990
kHAP5000 0.5426906 0.0143197 900 0.5145866 0.5707946
kHAP10000 0.4798140 0.0143197 900 0.4517100 0.5079180
kHAP20000 0.4357204 0.0143197 900 0.4076164 0.4638244
kHAP30000 0.4147448 0.0143197 900 0.3866408 0.4428488
Table showing the contrasts for the linear model testing for significant difference in the means of IMPUTE2 INFO Score for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0146484 0.0202512 900 0.7233364 0.9984671
kHAP100 - kHAP500 0.0315653 0.0202512 900 1.5586901 0.8268690
kHAP100 - kHAP1000 0.0641683 0.0202512 900 3.1686202 0.0418853
kHAP100 - kHAP2500 0.1368673 0.0202512 900 6.7584842 0.0000000
kHAP100 - kHAP5000 0.2142717 0.0202512 900 10.5807031 0.0000000
kHAP100 - kHAP10000 0.2771483 0.0202512 900 13.6855383 0.0000000
kHAP100 - kHAP20000 0.3212419 0.0202512 900 15.8628736 0.0000000
kHAP100 - kHAP30000 0.3422175 0.0202512 900 16.8986456 0.0000000
kHAP250 - kHAP500 0.0169169 0.0202512 900 0.8353537 0.9957737
kHAP250 - kHAP1000 0.0495199 0.0202512 900 2.4452838 0.2607954
kHAP250 - kHAP2500 0.1222188 0.0202512 900 6.0351478 0.0000001
kHAP250 - kHAP5000 0.1996233 0.0202512 900 9.8573667 0.0000000
kHAP250 - kHAP10000 0.2624998 0.0202512 900 12.9622019 0.0000000
kHAP250 - kHAP20000 0.3065934 0.0202512 900 15.1395372 0.0000000
kHAP250 - kHAP30000 0.3275690 0.0202512 900 16.1753092 0.0000000
kHAP500 - kHAP1000 0.0326030 0.0202512 900 1.6099301 0.7994464
kHAP500 - kHAP2500 0.1053019 0.0202512 900 5.1997941 0.0000088
kHAP500 - kHAP5000 0.1827064 0.0202512 900 9.0220130 0.0000000
kHAP500 - kHAP10000 0.2455829 0.0202512 900 12.1268482 0.0000000
kHAP500 - kHAP20000 0.2896765 0.0202512 900 14.3041835 0.0000000
kHAP500 - kHAP30000 0.3106521 0.0202512 900 15.3399556 0.0000000
kHAP1000 - kHAP2500 0.0726990 0.0202512 900 3.5898640 0.0104795
kHAP1000 - kHAP5000 0.1501034 0.0202512 900 7.4120829 0.0000000
kHAP1000 - kHAP10000 0.2129800 0.0202512 900 10.5169181 0.0000000
kHAP1000 - kHAP20000 0.2570736 0.0202512 900 12.6942534 0.0000000
kHAP1000 - kHAP30000 0.2780492 0.0202512 900 13.7300255 0.0000000
kHAP2500 - kHAP5000 0.0774044 0.0202512 900 3.8222189 0.0044584
kHAP2500 - kHAP10000 0.1402810 0.0202512 900 6.9270541 0.0000000
kHAP2500 - kHAP20000 0.1843746 0.0202512 900 9.1043894 0.0000000
kHAP2500 - kHAP30000 0.2053502 0.0202512 900 10.1401615 0.0000000
kHAP5000 - kHAP10000 0.0628766 0.0202512 900 3.1048352 0.0506674
kHAP5000 - kHAP20000 0.1069702 0.0202512 900 5.2821705 0.0000057
kHAP5000 - kHAP30000 0.1279458 0.0202512 900 6.3179425 0.0000000
kHAP10000 - kHAP20000 0.0440936 0.0202512 900 2.1773353 0.4214463
kHAP10000 - kHAP30000 0.0650692 0.0202512 900 3.2131074 0.0365637
kHAP20000 - kHAP30000 0.0209756 0.0202512 900 1.0357721 0.9823348
Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Boxplot of mean info score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3

Table showing the residuals for the linear model testing for significant difference in the IMPUTE2 INFO Score (following filtering to info > 0.3) between assigned haplogroups
Df Sum Sq Mean Sq F value Pr(>F)
k_hap 8 1.687115 0.2108894 73.78828 0
Residuals 900 2.572230 0.0028580 NA NA
Table showing the estimated marginal means for the linear model testing for significant difference in the means ofIMPUTE2 INFO Score (following filtering to info > 0.3) for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
k_hap emmean SE df lower.CL upper.CL
kHAP100 0.8443088 0.0053195 900 0.8338687 0.8547489
kHAP250 0.8430381 0.0053195 900 0.8325980 0.8534782
kHAP500 0.8422536 0.0053195 900 0.8318135 0.8526937
kHAP1000 0.8340947 0.0053195 900 0.8236545 0.8445348
kHAP2500 0.8172612 0.0053195 900 0.8068210 0.8277013
kHAP5000 0.8196562 0.0053195 900 0.8092161 0.8300963
kHAP10000 0.8720843 0.0053195 900 0.8616441 0.8825244
kHAP20000 0.9221291 0.0053195 900 0.9116890 0.9325692
kHAP30000 0.9478903 0.0053195 900 0.9374502 0.9583304
Table showing the contrasts for the linear model testing for significant difference in the means of IMPUTE2 INFO Score (following filtering to info > 0.3) for different Reference Panel Number of included reference haplotypes (k_hap) filtering thresholds
contrast estimate SE df t.ratio p.value
kHAP100 - kHAP250 0.0012707 0.0075229 900 0.1689062 1.0000000
kHAP100 - kHAP500 0.0020552 0.0075229 900 0.2731930 0.9999991
kHAP100 - kHAP1000 0.0102141 0.0075229 900 1.3577307 0.9131311
kHAP100 - kHAP2500 0.0270476 0.0075229 900 3.5953510 0.0102775
kHAP100 - kHAP5000 0.0246526 0.0075229 900 3.2769847 0.0299501
kHAP100 - kHAP10000 -0.0277755 0.0075229 900 -3.6921014 0.0072500
kHAP100 - kHAP20000 -0.0778203 0.0075229 900 -10.3443921 0.0000000
kHAP100 - kHAP30000 -0.1035815 0.0075229 900 -13.7687410 0.0000000
kHAP250 - kHAP500 0.0007845 0.0075229 900 0.1042868 1.0000000
kHAP250 - kHAP1000 0.0089435 0.0075229 900 1.1888245 0.9588473
kHAP250 - kHAP2500 0.0257770 0.0075229 900 3.4264448 0.0184072
kHAP250 - kHAP5000 0.0233819 0.0075229 900 3.1080785 0.0501858
kHAP250 - kHAP10000 -0.0290462 0.0075229 900 -3.8610076 0.0038427
kHAP250 - kHAP20000 -0.0790910 0.0075229 900 -10.5132982 0.0000000
kHAP250 - kHAP30000 -0.1048522 0.0075229 900 -13.9376472 0.0000000
kHAP500 - kHAP1000 0.0081589 0.0075229 900 1.0845377 0.9764060
kHAP500 - kHAP2500 0.0249924 0.0075229 900 3.3221580 0.0259281
kHAP500 - kHAP5000 0.0225974 0.0075229 900 3.0037917 0.0677508
kHAP500 - kHAP10000 -0.0298307 0.0075229 900 -3.9652944 0.0025558
kHAP500 - kHAP20000 -0.0798755 0.0075229 900 -10.6175851 0.0000000
kHAP500 - kHAP30000 -0.1056367 0.0075229 900 -14.0419340 0.0000000
kHAP1000 - kHAP2500 0.0168335 0.0075229 900 2.2376203 0.3820684
kHAP1000 - kHAP5000 0.0144384 0.0075229 900 1.9192540 0.6004931
kHAP1000 - kHAP10000 -0.0379896 0.0075229 900 -5.0498321 0.0000189
kHAP1000 - kHAP20000 -0.0880344 0.0075229 900 -11.7021228 0.0000000
kHAP1000 - kHAP30000 -0.1137956 0.0075229 900 -15.1264717 0.0000000
kHAP2500 - kHAP5000 -0.0023951 0.0075229 900 -0.3183663 0.9999970
kHAP2500 - kHAP10000 -0.0548231 0.0075229 900 -7.2874524 0.0000000
kHAP2500 - kHAP20000 -0.1048679 0.0075229 900 -13.9397431 0.0000000
kHAP2500 - kHAP30000 -0.1306291 0.0075229 900 -17.3640920 0.0000000
kHAP5000 - kHAP10000 -0.0524281 0.0075229 900 -6.9690861 0.0000000
kHAP5000 - kHAP20000 -0.1024729 0.0075229 900 -13.6213768 0.0000000
kHAP5000 - kHAP30000 -0.1282341 0.0075229 900 -17.0457257 0.0000000
kHAP10000 - kHAP20000 -0.0500448 0.0075229 900 -6.6522906 0.0000000
kHAP10000 - kHAP30000 -0.0758060 0.0075229 900 -10.0766396 0.0000000
kHAP20000 - kHAP30000 -0.0257612 0.0075229 900 -3.4243490 0.0185368